Relaxed concurrent data structures have become increasingly popular, due to
their scalability in graph processing and machine learning applications.
Despite considerable interest, there exist families of natural, high performing
randomized relaxed concurrent data structures, such as the popular MultiQueue
pattern for implementing relaxed priority queue data structures, for which no
guarantees are known in the concurrent setting. Our main contribution is in
showing for the first time that, under a set of analytic assumptions, a family
of relaxed concurrent data structures, including variants of MultiQueues, but
also a new approximate counting algorithm we call the MultiCounter, provides
strong probabilistic guarantees on the degree of relaxation with respect to the
sequential specification, in arbitrary concurrent executions. We formalize
these guarantees via a new correctness condition called distributional
linearizability, tailored to concurrent implementations with randomized
relaxations. Our result is based on a new analysis of an asynchronous variant
of the classic power-of-two-choices load balancing algorithm, in which
placement choices can be based on inconsistent, outdated information (this
result may be of independent interest). We validate our results empirically,
showing that the MultiCounter algorithm can implement scalable relaxed
timestamps, which in turn can improve the performance of the classic TL2
transactional algorithm by up to 3 times, for some settings of parameters.
We prove that path puzzles with complete row and column information--or
equivalently, 2D orthogonal discrete tomography with Hamiltonicity
constraint--are strongly NP-complete, ASP-complete, and #P-complete. Along the
way, we newly establish ASP-completeness and #P-completeness for 3-Dimensional
Matching and Numerical 3-Dimensional Matching.
Consider the following random process: we are given $n$ queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in state-of-the-art concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the single-choice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is $O( n )$ while the expected worst-case
cost is $O( n \log n )$. These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
balls-into-bins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance.
Several Hybrid Transactional Memory (HyTM) schemes have recently been
proposed to complement the fast, but best-effort, nature of Hardware
Transactional Memory (HTM) with a slow, reliable software backup. However, the
fundamental limitations of building a HyTM with nontrivial concurrency between
hardware and software transactions are still not well understood.
In this paper, we propose a general model for HyTM implementations, which
captures the ability of hardware transactions to buffer memory accesses, and
allows us to formally quantify and analyze the amount of overhead
(instrumentation) of a HyTM scheme. We prove the following: (1) it is
impossible to build a strictly serializable HyTM implementation that has both
uninstrumented reads and writes, even for weak progress guarantees, and (2)
under reasonable assumptions, in any opaque progressive HyTM, a hardware
transaction must incur instrumentation costs linear in the size of its data
set. We further provide two upper bound implementations whose instrumentation
costs are optimal with respect to their progress guarantees. In sum, this paper
captures for the first time an inherent trade-off between the degree of
concurrency a HyTM provides between hardware and software transactions, and the
amount of instrumentation overhead the implementation must incur.
The long-lived renaming problem appears in shared-memory systems where a set
of threads need to register and deregister frequently from the computation,
while concurrent operations scan the set of currently registered threads.
Instances of this problem show up in concurrent implementations of
transactional memory, flat combining, thread barriers, and memory reclamation
schemes for lock-free data structures. In this paper, we analyze a randomized
solution for long-lived renaming. The algorithmic technique we consider, called
the LevelArray, has previously been used for hashing and one-shot (single-use)
renaming. Our main contribu- tion is to prove that, in long-lived executions,
where processes may register and deregister polynomially many times, the
technique guarantees constant steps on average and O(log log n) steps with high
probability for registering, unit cost for deregistering, and O(n) steps for
collect queries, where n is an upper bound on the number of processes that may
be active at any point in time. We also show that the algorithm has the
surprising property that it is self-healing: under reasonable assumptions on
the schedule, operations running while the data structure is in a degraded
state implicitly help the data structure re-balance itself. This subtle
mechanism obviates the need for expensive periodic rebuilding procedures. Our
benchmarks validate this approach, showing that, for typical use parameters,
the average number of steps a process takes to register is less than two and
the worst-case number of steps is bounded by six, even in executions with
billions of operations. We contrast this with other randomized implementations,
whose worst-case behavior we show to be unreliable, and with deterministic
implementations, whose cost is linear in n.