
The transactional conflict problem arises in transactional systems whenever
two or more concurrent transactions clash on a data item.
While the standard solution to such conflicts is to immediately abort one of
the transactions, some practical systems consider the alternative of delaying
conflict resolution for a short interval, which may allow one of the
transactions to commit. The challenge in the transactional conflict problem is
to choose the optimal length of this delay interval so as to minimize the
overall running time penalty for the conflicting transactions. In this paper,
we propose a family of optimal online algorithms for the transactional conflict
problem.
Specifically, we consider variants of this problem which arise in different
implementations of transactional systems, namely "requestor wins" and
"requestor aborts" implementations: in the former, the recipient of a coherence
request is aborted, whereas in the latter, it is the requestor which has to
abort. Both strategies are implemented by real systems.
We show that the requestor aborts case can be reduced to a classic instance
of the ski rental problem, while the requestor wins case leads to a new version
of this classical problem, for which we derive optimal deterministic and
randomized algorithms.
Moreover, we prove that, under a simplified adversarial model, our algorithms
are constantcompetitive with the offline optimum in terms of throughput.
We validate our algorithmic results empirically through a hardware simulation
of hardware transactional memory (HTM), showing that our algorithms can lead to
nontrivial performance improvements for classic concurrent data structures.

Relaxed concurrent data structures have become increasingly popular, due to
their scalability in graph processing and machine learning applications.
Despite considerable interest, there exist families of natural, high performing
randomized relaxed concurrent data structures, such as the popular MultiQueue
pattern for implementing relaxed priority queue data structures, for which no
guarantees are known in the concurrent setting. Our main contribution is in
showing for the first time that, under a set of analytic assumptions, a family
of relaxed concurrent data structures, including variants of MultiQueues, but
also a new approximate counting algorithm we call the MultiCounter, provides
strong probabilistic guarantees on the degree of relaxation with respect to the
sequential specification, in arbitrary concurrent executions. We formalize
these guarantees via a new correctness condition called distributional
linearizability, tailored to concurrent implementations with randomized
relaxations. Our result is based on a new analysis of an asynchronous variant
of the classic poweroftwochoices load balancing algorithm, in which
placement choices can be based on inconsistent, outdated information (this
result may be of independent interest). We validate our results empirically,
showing that the MultiCounter algorithm can implement scalable relaxed
timestamps, which in turn can improve the performance of the classic TL2
transactional algorithm by up to 3 times, for some settings of parameters.

Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine
learning, representing the optimization backbone for training several classic
models, from regression to neural networks. Given the recent practical focus on
distributed machine learning, significant work has been dedicated to the
convergence properties of this algorithm under the inconsistent and noisy
updates arising from execution in a distributed environment. However,
surprisingly, the convergence properties of this classic algorithm in the
standard sharedmemory model are still not wellunderstood.
In this work, we address this gap, and provide new convergence bounds for
lockfree concurrent stochastic gradient descent, executing in the classic
asynchronous shared memory model, against a strong adaptive adversary. Our
results give improved upper and lower bounds on the "price of asynchrony" when
executing the fundamental SGD algorithm in a concurrent setting. They show that
this classic optimization tool can converge faster and with a wider range of
parameters than previously known under asynchronous iterations. At the same
time, we exhibit a fundamental tradeoff between the maximum delay in the
system and the rate at which SGD can converge, which governs the set of
parameters under which this algorithm can still work efficiently.

This paper studies the problem of distributed stochastic optimization in an
adversarial setting where, out of the $m$ machines which allegedly compute
stochastic gradients every iteration, an $\alpha$fraction are Byzantine, and
can behave arbitrarily and adversarially. Our main result is a variant of
stochastic gradient descent (SGD) which finds $\varepsilon$approximate
minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m}
+ \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional
minibatch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations,
but cannot tolerate Byzantine failures. Further, we provide a lower bound
showing that, up to logarithmic factors, our algorithm is
informationtheoretically optimal both in terms of sampling complexity and time
complexity.

One of the main drivers behind the rapid recent advances in machine learning
has been the availability of efficient system support. This comes both through
hardware acceleration, but also in the form of efficient software frameworks
and programming models. Despite significant progress, scaling computeintensive
machine learning workloads to a large number of compute nodes is still a
challenging task, with significant latency and bandwidth demands. In this
paper, we address this challenge, by proposing SPARCML, a general, scalable
communication layer for machine learning applications. SPARCML is built on the
observation that many distributed machine learning algorithms either have
naturally sparse communication patters, or have updates which can be sparsified
in a structured way for improved performance, without any convergence or
accuracy loss. To exploit this insight, we design and implement a set of
communication efficient protocols for sparse input data, in conjunction with
efficient machine learning algorithms which can leverage these primitives. Our
communication protocols generalize standard collective operations, by allowing
processes to contribute sparse input data vectors, of heterogeneous sizes. We
call these operations sparseinput collectives, and present efficient practical
algorithms with strong theoretical bounds on their running time and
communication cost. Our generic communication layer is enriched with additional
features, such support for nonblocking (asynchronous) operations, and support
for lowprecision data representations. We validate our algorithmic results
experimentally on a range of largescale machine learning applications and
target architectures, showing that we can leverage sparsity for order
ofmagnitude runtime savings, compared to stateofthe art methods and
frameworks.

Deep neural networks (DNNs) continue to make significant advances, solving
tasks from image classification to translation or reinforcement learning. One
aspect of the field receiving considerable attention is efficiently executing
deep models in resourceconstrained environments, such as mobile or embedded
devices. This paper focuses on this problem, and proposes two new compression
methods, which jointly leverage weight quantization and distillation of larger
teacher networks into smaller student networks. The first method we propose is
called quantized distillation and leverages distillation during the training
process, by incorporating distillation loss, expressed with respect to the
teacher, into the training of a student network whose weights are quantized to
a limited set of levels. The second method, differentiable quantization,
optimizes the location of quantization points through stochastic gradient
descent, to better fit the behavior of the teacher model. We validate both
methods through experiments on convolutional and recurrent architectures. We
show that quantized shallow students can reach similar accuracy levels to
fullprecision teacher models, while providing order of magnitude compression,
and inference speedup that is linear in the depth reduction. In sum, our
results enable DNNs for resourceconstrained environments to leverage
architecture and accuracy advances developed on more powerful devices.

Modern scientific instruments produce vast amounts of data, which can
overwhelm the processing ability of computer systems. Lossy compression of data
is an intriguing solution but comes with its own dangers, such as potential
signal loss, and the need for careful parameter optimization. In this work, we
focus on a setting where this problem is especially acute compressive sensing
frameworks for radio astronomy and ask: Can the precision of the data
representation be lowered for all input data, with recovery guarantees and good
practical performance? Our first contribution is a theoretical analysis of the
Iterative Hard Thresholding (IHT) algorithm when all input data, that is, the
measurement matrix and the observation, are quantized aggressively, to as
little as 2 bits per value. Under reasonable constraints, we show that there
exists a variant of low precision IHT which can still provide recovery
guarantees. The second contribution is a tailored analysis of our general
quantized framework to radio astronomy, showing that its conditions are
satisfied in this case. We evaluate our approach using an FPGA implementation,
and show that it can achieve up to 9.19x speed up with negligible loss of
recovery quality, on real telescope data

It is safe to assume that, for the foreseeable future, machine learning,
especially deep learning will remain both data and computationhungry. In this
paper, we ask: Can we build a global exchange where everyone can contribute
computation and data to train the next generation of machine learning
applications?
We present an early, but running prototype of DataBright, a system that turns
the creation of training examples and the sharing of computation into an
investment mechanism. Unlike most crowdsourcing platforms, where the
contributor gets paid when they submit their data, DataBright pays dividends
whenever a contributor's data or hardware is used by someone to train a machine
learning model. The contributor becomes a shareholder in the dataset they
created. To enable the measurement of usage, a computation platform that
contributors can trust is also necessary. DataBright thus merges both a data
market and a trusted computation market.
We illustrate that trusted computation can enable the creation of an AI
market, where each data point has an exact value that should be paid to its
creator. DataBright allows data creators to retain ownership of their
contribution and attaches to it a measurable value. The value of the data is
given by its utility in subsequent distributed computation done on the
DataBright computation market. The computation market allocates tasks and
subsequent payments to pooled hardware. This leads to the creation of a
decentralized AI cloud. Our experiments show that trusted hardware such as
Intel SGX can be added to the usual ML pipeline with no additional costs. We
use this setting to orchestrate distributed computation that enables the
creation of a computation market. DataBright is available for download at
https://github.com/ds3lab/databright.

Population protocols are a model of distributed computing, in which $n$
agents with limited local state interact randomly, and cooperate to
collectively compute global predicates. An extensive series of papers, across
different communities, has examined the computability and complexity
characteristics of this model. Majority, or consensus, is a central task, in
which agents need to collectively reach a decision as to which one of two
states $A$ or $B$ had a higher initial count. Two complexity metrics are
important: the time that a protocol requires to stabilize to an output
decision, and the state space size that each agent requires.
It is known that majority requires $\Omega(\log \log n)$ states per agent to
allow for polylogarithmic time stabilization, and that $O(\log^2 n)$ states
are sufficient. Thus, there is an exponential gap between the upper and lower
bounds.
We address this question. We provide a new lower bound of $\Omega(\log n)$
states for any protocol which stabilizes in $O( n^{1c} )$ time, for any $c >
0$ constant. This result is conditional on basic monotonicity and output
assumptions, satisfied by all known protocols. Technically, it represents a
significant departure from previous lower bounds. Instead of relying on dense
configurations, we introduce a new surgery technique to construct executions
which contradict the correctness of algorithms that stabilize too fast.
Subsequently, our lower bound applies to general initial configurations.
We give an algorithm for majority which uses $O(\log n)$ states, and
stabilizes in $O(\log^2 n)$ time. Central to the algorithm is a new leaderless
phase clock, which allows nodes to synchronize in phases of $\Theta(n \log{n})$
consecutive interactions using $O(\log n)$ states per node. We also employ our
phase clock to build a leader election algorithm with $O(\log n )$ states,
which stabilizes in $O(\log^2 n)$ time.

In contrast to electronic computation, chemical computation is noisy and
susceptible to a variety of sources of error, which has prevented the
construction of robust complex systems. To be effective, chemical algorithms
must be designed with an appropriate error model in mind. Here we consider the
model of chemical reaction networks that preserve molecular count (population
protocols), and ask whether computation can be made robust to a natural model
of unintended "leak" reactions. Our definition of leak is motivated by both the
particular spurious behavior seen when implementing chemical reaction networks
with DNA strand displacement cascades, as well as the unavoidable side
reactions in any implementation due to the basic laws of chemistry. We develop
a new "Robust Detection" algorithm for the problem of fast (logarithmic time)
single molecule detection, and prove that it is robust to this general model of
leaks. Besides potential applications in single molecule detection, the
errorcorrection ideas developed here might enable a new class of
robustbydesign chemical algorithms. Our analysis is based on a nonstandard
hybrid argument, combining ideas from discrete analysis of population protocols
with classic Markov chain techniques.

Recently there has been significant interest in training machinelearning
models at low precision: by reducing precision, one can reduce computation and
communication by one order of magnitude. We examine training at reduced
precision, both from a theoretical and practical perspective, and ask: is it
possible to train models at endtoend low precision with provable guarantees?
Can this lead to consistent orderofmagnitude speedups? We present a framework
called ZipML to answer these questions. For linear models, the answer is yes.
We develop a simple framework based on one simple but novel strategy called
double sampling. Our framework is able to execute training at low precision
with no bias, guaranteeing convergence, whereas naive quantization would
introduce significant bias. We validate our framework across a range of
applications, and show that it enables an FPGA prototype that is up to 6.5x
faster than an implementation using full 32bit precision. We further develop a
varianceoptimal stochastic quantization strategy and show that it can make a
significant difference in a variety of settings. When applied to linear models
together with double sampling, we save up to another 1.7x in data movement
compared with uniform quantization. When training deep networks with quantized
models, we achieve higher accuracy than the stateoftheart XNORNet. Finally,
we extend our framework through approximation to nonlinear models, such as
SVM. We show that, although using lowprecision data induces bias, we can
appropriately bound and control the bias. We find in practice 8bit precision
is often sufficient to converge to the correct solution. Interestingly,
however, in practice we notice that our framework does not always outperform
the naive rounding approach. We discuss this negative result in detail.

Consider the following random process: we are given $n$ queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in stateoftheart concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the singlechoice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is $O( n )$ while the expected worstcase
cost is $O( n \log n )$. These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
ballsintobins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance.

Parallel implementations of stochastic gradient descent (SGD) have received
significant research attention, thanks to excellent scalability properties of
this algorithm, and to its efficiency in the context of training deep neural
networks. A fundamental barrier for parallelizing largescale SGD is the fact
that the cost of communicating the gradient updates between nodes can be very
large. Consequently, lossy compression heuristics have been proposed, by which
nodes only communicate quantized gradients. Although effective in practice,
these heuristics do not always provably converge, and it is not clear whether
they are optimal.
In this paper, we propose Quantized SGD (QSGD), a family of compression
schemes which allow the compression of gradient updates at each node, while
guaranteeing convergence under standard assumptions. QSGD allows the user to
trade off compression and convergence time: it can communicate a sublinear
number of bits per iteration in the model dimension, and can achieve
asymptotically optimal communication cost. We complement our theoretical
results with empirical data, showing that QSGD can significantly reduce
communication cost, while being competitive with standard uncompressed
techniques on a variety of real tasks.
In particular, experiments show that gradient quantization applied to
training of deep neural networks for image classification and automated speech
recognition can lead to significant reductions in communication cost, and
endtoend training time. For instance, on 16 GPUs, we are able to train a
ResNet152 network on ImageNet 1.8x faster to full accuracy. Of note, we show
that there exist generic parameter settings under which all known network
architectures preserve or slightly improve their full accuracy when using
quantization.

Population protocols are a popular model of distributed computing, in which
randomlyinteracting agents with little computational power cooperate to
jointly perform computational tasks. Inspired by developments in molecular
computation, and in particular DNA computing, recent algorithmic work has
focused on the complexity of solving simple yet fundamental tasks in the
population model, such as leader election (which requires stabilization to a
single agent in a special "leader" state), and majority (in which agents must
stabilize to a decision as to which of two possible initial states had higher
initial count). Known results point towards an inherent tradeoff between the
time complexity of such algorithms, and the space complexity, i.e. size of the
memory available to each agent.
In this paper, we explore this tradeoff and provide new upper and lower
bounds for majority and leader election. First, we prove a unified lower bound,
which relates the space available per node with the time complexity achievable
by a protocol: for instance, our result implies that any protocol solving
either of these tasks for $n$ agents using $O( \log \log n )$ states must take
$\Omega( n / \rm{polylog} n )$ expected time. This is the first result to
characterize time complexity for protocols which employ superconstant number
of states per node, and proves that fast, polylogarithmic running times
require protocols to have relatively large space costs.
On the positive side, we give algorithms showing that fast, polylogarithmic
stabilization time can be achieved using $O( \log^2 n )$ space per node, in the
case of both tasks. Overall, our results highlight a time complexity separation
between $O(\log \log n)$ and $\Theta( \log^2 n )$ state space size for both
majority and leader election in population protocols, and introduce new
techniques, which should be applicable more broadly.

Population protocols are networks of finitestate agents, interacting
randomly, and updating their states using simple rules. Despite their extreme
simplicity, these systems have been shown to cooperatively perform complex
computational tasks, such as simulating register machines to compute standard
arithmetic functions. The election of a unique leader agent is a key
requirement in such computational constructions. Yet, the fastest currently
known population protocol for electing a leader only has linear stabilization
time, and, it has recently been shown that no population protocol using a
constant number of states per node may overcome this linear bound.
In this paper, we give the first population protocol for leader election with
polylogarithmic stabilization time, using polylogarithmic memory states per
node. The protocol structure is quite simple: each node has an associated
value, and is either a leader (still in contention) or a minion (following some
leader). A leader keeps incrementing its value and "defeats" other leaders in
onetoone interactions, and will drop from contention and become a minion if
it meets a leader with higher value. Importantly, a leader also drops out if it
meets a minion with higher absolute value. While these rules are quite simple,
the proof that this algorithm achieves polylogarithmic stabilization time is
nontrivial. In particular, the argument combines careful use of concentration
inequalities with anticoncentration bounds, showing that the leaders' values
become spread apart as the execution progresses, which in turn implies that
straggling leaders get quickly eliminated. We complement our analysis with
empirical results, showing that our protocol stabilizes extremely fast, even
for large network sizes.

Several Hybrid Transactional Memory (HyTM) schemes have recently been
proposed to complement the fast, but besteffort, nature of Hardware
Transactional Memory (HTM) with a slow, reliable software backup. However, the
fundamental limitations of building a HyTM with nontrivial concurrency between
hardware and software transactions are still not well understood.
In this paper, we propose a general model for HyTM implementations, which
captures the ability of hardware transactions to buffer memory accesses, and
allows us to formally quantify and analyze the amount of overhead
(instrumentation) of a HyTM scheme. We prove the following: (1) it is
impossible to build a strictly serializable HyTM implementation that has both
uninstrumented reads and writes, even for weak progress guarantees, and (2)
under reasonable assumptions, in any opaque progressive HyTM, a hardware
transaction must incur instrumentation costs linear in the size of its data
set. We further provide two upper bound implementations whose instrumentation
costs are optimal with respect to their progress guarantees. In sum, this paper
captures for the first time an inherent tradeoff between the degree of
concurrency a HyTM provides between hardware and software transactions, and the
amount of instrumentation overhead the implementation must incur.

The problem of electing a leader from among $n$ contenders is one of the
fundamental questions in distributed computing. In its simplest formulation,
the task is as follows: given $n$ processors, all participants must eventually
return a win or lose indication, such that a single contender may win. Despite
a considerable amount of work on leader election, the following question is
still open: can we elect a leader in an asynchronous faultprone system faster
than just running a $\Theta(\log n)$time tournament, against a strong adaptive
adversary?
In this paper, we answer this question in the affirmative, improving on a
decadesold upper bound. We introduce two new algorithmic ideas to reduce the
time complexity of electing a leader to $O(\log^* n)$, using $O(n^2)$
pointtopoint messages. A nontrivial application of our algorithm is a new
upper bound for the tight renaming problem, assigning $n$ items to the $n$
participants in expected $O(\log^2 n)$ time and $O(n^2)$ messages. We
complement our results with lower bound of $\Omega(n^2)$ messages for solving
these two problems, closing the question of their message complexity.

The longlived renaming problem appears in sharedmemory systems where a set
of threads need to register and deregister frequently from the computation,
while concurrent operations scan the set of currently registered threads.
Instances of this problem show up in concurrent implementations of
transactional memory, flat combining, thread barriers, and memory reclamation
schemes for lockfree data structures. In this paper, we analyze a randomized
solution for longlived renaming. The algorithmic technique we consider, called
the LevelArray, has previously been used for hashing and oneshot (singleuse)
renaming. Our main contribu tion is to prove that, in longlived executions,
where processes may register and deregister polynomially many times, the
technique guarantees constant steps on average and O(log log n) steps with high
probability for registering, unit cost for deregistering, and O(n) steps for
collect queries, where n is an upper bound on the number of processes that may
be active at any point in time. We also show that the algorithm has the
surprising property that it is selfhealing: under reasonable assumptions on
the schedule, operations running while the data structure is in a degraded
state implicitly help the data structure rebalance itself. This subtle
mechanism obviates the need for expensive periodic rebuilding procedures. Our
benchmarks validate this approach, showing that, for typical use parameters,
the average number of steps a process takes to register is less than two and
the worstcase number of steps is bounded by six, even in executions with
billions of operations. We contrast this with other randomized implementations,
whose worstcase behavior we show to be unreliable, and with deterministic
implementations, whose cost is linear in n.

Lockfree concurrent algorithms guarantee that some concurrent operation will
always make progress in a finite number of steps. Yet programmers prefer to
treat concurrent code as if it were waitfree, guaranteeing that all operations
always make progress. Unfortunately, designing waitfree algorithms is
generally a very complex task, and the resulting algorithms are not always
efficient. While obtaining efficient waitfree algorithms has been a longtime
goal for the theory community, most nonblocking commercial code is only
lockfree.
This paper suggests a simple solution to this problem. We show that, for a
large class of lock free algorithms, under scheduling conditions which
approximate those found in commercial hardware architectures, lockfree
algorithms behave as if they are waitfree. In other words, programmers can
keep on designing simple lockfree algorithms instead of complex waitfree
ones, and in practice, they will get waitfree progress.
Our main contribution is a new way of analyzing a general class of lockfree
algorithms under a stochastic scheduler. Our analysis relates the individual
performance of processes with the global performance of the system using Markov
chain lifting between a complex perprocess chain and a simpler system progress
chain. We show that lockfree algorithms are not only waitfree with
probability 1, but that in fact a general subset of lockfree algorithms can be
closely bounded in terms of the average number of steps required until an
operation completes.
To the best of our knowledge, this is the first attempt to analyze progress
conditions, typically stated in relation to a worst case adversary, in a
stochastic model capturing their expected asymptotic behavior.