
For many years, Herlihy's elegant computability based Consensus Hierarchy has
been our best explanation of the relative power of various types of
multiprocessor synchronization objects when used in deterministic algorithms.
However, key to this hierarchy is treating synchronization instructions as
distinct objects, an approach that is far from the realworld, where
multiprocessor programs apply synchronization instructions to collections of
arbitrary memory locations. We were surprised to realize that, when considering
instructions applied to memory locations, the computability based hierarchy
collapses. This leaves open the question of how to better capture the power of
various synchronization instructions.
In this paper, we provide an approach to answering this question. We present
a hierarchy of synchronization instructions, classified by their space
complexity in solving obstructionfree consensus. Our hierarchy provides a
classification of combinations of known instructions that seems to fit with our
intuition of how useful some are in practice, while questioning the
effectiveness of others. We prove an essentially tight characterization of the
power of buffered read and write instructions.Interestingly, we show a similar
result for multilocation atomic assignments.

Deep neural networks trained on large supervised datasets have led to
impressive results in image classification and other tasks. However,
wellannotated datasets can be timeconsuming and expensive to collect, lending
increased interest to larger but noisy datasets that are more easily obtained.
In this paper, we show that deep neural networks are capable of generalizing
from training data for which true labels are massively outnumbered by incorrect
labels. We demonstrate remarkably high test performance after training on
corrupted data from MNIST, CIFAR, and ImageNet. For example, on MNIST we obtain
test accuracy above 90 percent even after each clean training example has been
diluted with 100 randomlylabeled examples. Such behavior holds across multiple
patterns of label noise, even when erroneous labels are biased towards
confusing classes. We show that training in this regime requires a significant
but manageable increase in dataset size that is related to the factor by which
correct labels have been diluted. Finally, we provide an analysis of our
results that shows how increasing noise decreases the effective batch size.

Deep convolutional neural networks (ConvNets) of 3dimensional kernels allow
joint modeling of spatiotemporal features. These networks have improved
performance of video and volumetric image analysis, but have been limited in
size due to the low memory ceiling of GPU hardware. Existing CPU
implementations overcome this constraint but are impractically slow. Here we
extend and optimize the faster Winogradclass of convolutional algorithms to
the $N$dimensional case and specifically for CPU hardware. First, we remove
the need to manually handcraft algorithms by exploiting the relaxed
constraints and cheap sparse access of CPU memory. Second, we maximize CPU
utilization and multicore scalability by transforming data matrices to be
cacheaware, integer multiples of AVX vector widths. Treating 2dimensional
ConvNets as a special (and the least beneficial) case of our approach, we
demonstrate a 5 to 25fold improvement in throughput compared to previous
stateoftheart.

Traditional image and video compression algorithms rely on handcrafted
encoder/decoder pairs (codecs) that lack adaptability and are agnostic to the
data being compressed. Here we describe the concept of generative compression,
the compression of data using generative models, and suggest that it is a
direction worth pursuing to produce more accurate and visually pleasing
reconstructions at much deeper compression levels for both image and video
data. We also demonstrate that generative compression is ordersofmagnitude
more resilient to bit error rates (e.g. from noisy wireless channels) than
traditional variablelength coding schemes.

Deep learning algorithms for connectomics rely upon localized classification,
rather than overall morphology. This leads to a high incidence of erroneously
merged objects. Humans, by contrast, can easily detect such errors by acquiring
intuition for the correct morphology of objects. Biological neurons have
complicated and variable shapes, which are challenging to learn, and merge
errors take a multitude of different forms. We present an algorithm, MergeNet,
that shows 3D ConvNets can, in fact, detect merge errors from highlevel
neuronal morphology. MergeNet follows unsupervised training and operates across
datasets. We demonstrate the performance of MergeNet both on a variety of
connectomics data and on a dataset created from merged MNIST images.

Connectomics is an emerging field in neuroscience that aims to reconstruct
the 3dimensional morphology of neurons from electron microscopy (EM) images.
Recent studies have successfully demonstrated the use of convolutional neural
networks (ConvNets) for segmenting cell membranes to individuate neurons.
However, there has been comparatively little success in highthroughput
identification of the intercellular synaptic connections required for deriving
connectivity graphs.
In this study, we take a compositional approach to segmenting synapses,
modeling them explicitly as an intercellular cleft colocated with an
asymmetric vesicle density along a cell membrane. Instead of requiring a deep
network to learn all natural combinations of this compositionality, we train
lighter networks to model the simpler marginal distributions of membranes,
clefts and vesicles from just 100 electron microscopy samples. These feature
maps are then combined with simple rulesbased heuristics derived from prior
biological knowledge.
Our approach to synapse detection is both more accurate than previous
stateoftheart (7% higher recall and 5% higher F1score) and yields a 20fold
speedup compared to the previous fastest implementations. We demonstrate by
reconstructing the first complete, directed connectome from the largest
available anisotropic microscopy dataset (245 GB) of mouse somatosensory cortex
(S1) in just 9.7 hours on a single sharedmemory CPU system. We believe that
this work marks an important step toward the goal of a microscopepace
streaming connectomics pipeline.

The field of connectomics faces unprecedented "big data" challenges. To
reconstruct neuronal connectivity, automated pixellevel segmentation is
required for petabytes of streaming electron microscopy data. Existing
algorithms provide relatively good accuracy but are unacceptably slow, and
would require years to extract connectivity graphs from even a single cubic
millimeter of neural tissue. Here we present a viable realtime solution, a
multipass pipeline optimized for sharedmemory multicore systems, capable of
processing data at near the terabyteperhour pace of multibeam electron
microscopes. The pipeline makes an initial fastpass over the data, and then
makes a second slowpass to iteratively correct errors in the output of the
fastpass. We demonstrate the accuracy of a sparse slowpass reconstruction
algorithm and suggest new methods for detecting morphological errors. Our
fastpass approach provided many algorithmic challenges, including the design
and implementation of novel shallow convolutional neural nets and the
parallelization of watershed and objectmerging techniques. We use it to
reconstruct, from image stack to skeletons, the full dataset of Kasthuri et al.
(463 GB capturing 120,000 cubic microns) in a matter of hours on a single
multicore machine rather than the weeks it has taken in the past on much larger
distributed systems.

Several Hybrid Transactional Memory (HyTM) schemes have recently been
proposed to complement the fast, but besteffort, nature of Hardware
Transactional Memory (HTM) with a slow, reliable software backup. However, the
fundamental limitations of building a HyTM with nontrivial concurrency between
hardware and software transactions are still not well understood.
In this paper, we propose a general model for HyTM implementations, which
captures the ability of hardware transactions to buffer memory accesses, and
allows us to formally quantify and analyze the amount of overhead
(instrumentation) of a HyTM scheme. We prove the following: (1) it is
impossible to build a strictly serializable HyTM implementation that has both
uninstrumented reads and writes, even for weak progress guarantees, and (2)
under reasonable assumptions, in any opaque progressive HyTM, a hardware
transaction must incur instrumentation costs linear in the size of its data
set. We further provide two upper bound implementations whose instrumentation
costs are optimal with respect to their progress guarantees. In sum, this paper
captures for the first time an inherent tradeoff between the degree of
concurrency a HyTM provides between hardware and software transactions, and the
amount of instrumentation overhead the implementation must incur.

JohnsonLindenstrauss (JL) matrices implemented by sparse random synaptic
connections are thought to be a prime candidate for how convergent pathways in
the brain compress information. However, to date, there is no complete
mathematical support for such implementations given the constraints of real
neural tissue. The fact that neurons are either excitatory or inhibitory
implies that every so implementable JL matrix must be signconsistent (i.e.,
all entries in a single column must be either all nonnegative or all
nonpositive), and the fact that any given neuron connects to a relatively
small subset of other neurons implies that the JL matrix had better be sparse.
We construct sparse JL matrices that are signconsistent, and prove that our
construction is essentially optimal. Our work answers a mathematical question
that was triggered by earlier work and is necessary to justify the existence of
JL compression in the brain, and emphasizes that inhibition is crucial if
neurons are to perform efficient, correlationpreserving compression.

All consensus hierarchies in the literature assume that we have, in addition
to copies of a given object, an unbounded number of registers. But why do we
really need these registers?
This paper considers what would happen if one attempts to solve consensus
using various objects but without any registers. We show that under a
reasonable assumption, objects like queues and stacks cannot emulate the
missing registers. We also show that, perhaps surprisingly, initialization,
shown to have no computational consequences when registers are readily
available, is crucial in determining the synchronization power of objects when
no registers are allowed. Finally, we show that without registers, the number
of available objects affects the level of consensus that can be solved.
Our work thus raises the question of whether consensus hierarchies which
assume an unbounded number of registers truly capture synchronization power,
and begins a line of research aimed at better understanding the interaction
between readwrite memory and the powerful synchronization operations available
on modern architectures.

The longlived renaming problem appears in sharedmemory systems where a set
of threads need to register and deregister frequently from the computation,
while concurrent operations scan the set of currently registered threads.
Instances of this problem show up in concurrent implementations of
transactional memory, flat combining, thread barriers, and memory reclamation
schemes for lockfree data structures. In this paper, we analyze a randomized
solution for longlived renaming. The algorithmic technique we consider, called
the LevelArray, has previously been used for hashing and oneshot (singleuse)
renaming. Our main contribu tion is to prove that, in longlived executions,
where processes may register and deregister polynomially many times, the
technique guarantees constant steps on average and O(log log n) steps with high
probability for registering, unit cost for deregistering, and O(n) steps for
collect queries, where n is an upper bound on the number of processes that may
be active at any point in time. We also show that the algorithm has the
surprising property that it is selfhealing: under reasonable assumptions on
the schedule, operations running while the data structure is in a degraded
state implicitly help the data structure rebalance itself. This subtle
mechanism obviates the need for expensive periodic rebuilding procedures. Our
benchmarks validate this approach, showing that, for typical use parameters,
the average number of steps a process takes to register is less than two and
the worstcase number of steps is bounded by six, even in executions with
billions of operations. We contrast this with other randomized implementations,
whose worstcase behavior we show to be unreliable, and with deterministic
implementations, whose cost is linear in n.

Lockfree concurrent algorithms guarantee that some concurrent operation will
always make progress in a finite number of steps. Yet programmers prefer to
treat concurrent code as if it were waitfree, guaranteeing that all operations
always make progress. Unfortunately, designing waitfree algorithms is
generally a very complex task, and the resulting algorithms are not always
efficient. While obtaining efficient waitfree algorithms has been a longtime
goal for the theory community, most nonblocking commercial code is only
lockfree.
This paper suggests a simple solution to this problem. We show that, for a
large class of lock free algorithms, under scheduling conditions which
approximate those found in commercial hardware architectures, lockfree
algorithms behave as if they are waitfree. In other words, programmers can
keep on designing simple lockfree algorithms instead of complex waitfree
ones, and in practice, they will get waitfree progress.
Our main contribution is a new way of analyzing a general class of lockfree
algorithms under a stochastic scheduler. Our analysis relates the individual
performance of processes with the global performance of the system using Markov
chain lifting between a complex perprocess chain and a simpler system progress
chain. We show that lockfree algorithms are not only waitfree with
probability 1, but that in fact a general subset of lockfree algorithms can be
closely bounded in terms of the average number of steps required until an
operation completes.
To the best of our knowledge, this is the first attempt to analyze progress
conditions, typically stated in relation to a worst case adversary, in a
stochastic model capturing their expected asymptotic behavior.