
Computer networks have become a critical infrastructure. In fact, networks
should not only meet strict requirements in terms of correctness, availability,
and performance, but they should also be very flexible and support fast
updates, e.g., due to policy changes, increasing traffic, or failures. This
paper presents a structured survey of mechanism and protocols to update
computer networks in a fast and consistent manner. In particular, we identify
and discuss the different desirable consistency properties that should be
provided throughout a network update, the algorithmic techniques which are
needed to meet these consistency properties, and the implications on the speed
and costs at which updates can be performed. We also explain the relationship
between consistent network update problems and classic algorithmic optimization
ones. While our survey is mainly motivated by the advent of SoftwareDefined
Networks (SDNs) and their primary need for correct and efficient update
techniques, the fundamental underlying problems are not new, and we provide a
historical perspective of the subject as well.

Software Transactional Memory systems (STMs) have garnered significant
interest as an elegant alternative for addressing synchronization and
concurrency issues with multithreaded programming in multicore systems.
Client programs use STMs by issuing transactions. STM ensures that transaction
either commits or aborts. A transaction aborted due to conflicts is typically
reissued with the expectation that it will complete successfully in a
subsequent incarnation. However, many existing STMs fail to provide starvation
freedom, i.e., in these systems, it is possible that concurrency conflicts may
prevent an incarnated transaction from committing. To overcome this limitation,
we systematically derive a novel starvation free algorithm for multiversion
STM. Our algorithm can be used either with the case where the number of
versions is unbounded and garbage collection is used or where only the latest K
versions are maintained, KSFTM. We have demonstrated that our proposed
algorithm performs better than existing stateoftheart STMs.

Distributed machine learning algorithms enable learning of models from
datasets that are distributed over a network without gathering the data at a
centralized location. While efficient distributed algorithms have been
developed under the assumption of faultless networks, failures that can render
these algorithms nonfunctional occur frequently in the real world. This paper
focuses on the problem of Byzantine failures, which are the hardest to
safeguard against in distributed algorithms. While Byzantine fault tolerance
has a rich history, existing work does not translate into efficient and
practical algorithms for highdimensional learning in fully distributed (also
known as decentralized) settings. In this paper, an algorithm termed
Byzantineresilient distributed coordinate descent (ByRDiE) is developed and
analyzed that enables distributed learning in the presence of Byzantine
failures. Theoretical analysis (convex settings) and numerical experiments
(convex and nonconvex settings) highlight its usefulness for highdimensional
distributed learning in the presence of Byzantine failures.

Accelerator architectures specialize in executing SIMD (single instruction,
multiple data) in lockstep. Because the majority of CUDA applications are
parallelized loops, control flow information can provide an indepth
characterization of a kernel. CUDAflow is a tool that statically separates CUDA
binaries into basic block regions and dynamically measures instruction and
basic block frequencies. CUDAflow captures this information in a control flow
graph (CFG) and performs subgraph matching across various kernel's CFGs to gain
insights to an application's resource requirements, based on the shape and
traversal of the graph, instruction operations executed and registers
allocated, among other information. The utility of CUDAflow is demonstrated
with SHOC and Rodinia application case studies on a variety of GPU
architectures, revealing novel thread divergence characteristics that
facilitates end users, autotuners and compilers in generating high performing
code.

Analyzing massive complex networks yields promising insights about our
everyday lives. Building scalable algorithms to do so is a challenging task
that requires a careful analysis and an extensive evaluation. However,
engineering such algorithms is often hindered by the scarcity of
publicly~available~datasets.
Network generators serve as a tool to alleviate this problem by providing
synthetic instances with controllable parameters. However, many network
generators fail to provide instances on a massive scale due to their sequential
nature or resource constraints. Additionally, truly scalable network generators
are few and often limited in their realism.
In this work, we present novel generators for a variety of network models
that are frequently used as benchmarks. By making use of pseudorandomization
and divideandconquer schemes, our generators follow a communicationfree
paradigm. The resulting generators are thus embarrassingly parallel and have a
near optimal scaling behavior. This allows us to generate instances of up to
$2^{43}$ vertices and $2^{47}$ edges in less than 22 minutes on 32768 cores.
Therefore, our generators allow new graph families to be used on an
unprecedented scale.

Most of today's highspeed switches and routers adopt an inputqueued
crossbar switch architecture. Such a switch needs to compute a matching
(crossbar schedule) between the input ports and output ports during each
switching cycle (time slot). A key research challenge in designing large (in
number of input/output ports $N$) inputqueued crossbar switches is to develop
crossbar scheduling algorithms that can compute "high quality" matchings 
i.e., those that result in high switch throughput (ideally $100\%$) and low
queueing delays for packets  at line rates. SERENA is one such algorithm: it
outputs excellent matching decisions that result in $100\%$ switch throughput
and reasonably good queueing delays. However, since SERENA is a centralized
algorithm with $O(N)$ computational complexity, it cannot support switches that
both are large and have a very high line rate per port. In this work, we
propose SERENADE (SERENA, the Distributed Edition), a parallel iterative
algorithm that emulates SERENA in only $O(\log N)$ iterations between input
ports and output ports, and hence has a time complexity of only $O(\log N)$ per
port. We prove that SERENADE can exactly emulate SERENA. We also propose an
earlystop version of SERENADE, called OSERENADE, to only approximately
emulate SERENA. Through extensive simulations, we show that OSERENADE can
achieve 100\% throughput and that it has similar as or slightly better delay
performance than SERENA under various load conditions and traffic patterns.

In this paper we improve the deterministic complexity of two fundamental
communication primitives in the classical model of adhoc radio networks with
unknown topology: broadcasting and wakeup. We consider an unknown radio
network, in which all nodes have no prior knowledge about network topology, and
know only the size of the network $n$, the maximum indegree of any node
$\Delta$, and the eccentricity of the network $D$.
For such networks, we first give an algorithm for wakeup, based on the
existence of small universal synchronizers. This algorithm runs in
$O(\frac{\min\{n, D \Delta\} \log n \log \Delta}{\log\log \Delta})$ time, the
fastest known in both directed and undirected networks, improving over the
previous best $O(n \log^2n)$time result across all ranges of parameters, but
particularly when maximum indegree is small.
Next, we introduce a new combinatorial framework of block synchronizers and
prove the existence of such objects of low size. Using this framework, we
design a new deterministic algorithm for the fundamental problem of
broadcasting, running in $O(n \log D \log\log\frac{D \Delta}{n})$ time. This is
the fastest known algorithm for the problem in directed networks, improving
upon the $O(n \log n \log \log n)$time algorithm of De Marco (2010) and the
$O(n \log^2 D)$time algorithm due to Czumaj and Rytter (2003). It is also the
first to come within a loglogarithmic factor of the $\Omega(n \log D)$ lower
bound due to Clementi et al.\ (2003).
Our results also have direct implications on the fastest \emph{deterministic
leader election} and \emph{clock synchronization} algorithms in both directed
and undirected radio networks, tasks which are commonly used as building blocks
for more complex procedures.

The \emph{beep model} is a very weak communications model in which devices in
a network can communicate only via beeps and silence. As a result of its weak
assumptions, it has broad applicability to many different implementations of
communications networks. This comes at the cost of a restrictive environment
for algorithm design.
Despite being only recently introduced, the beep model has received
considerable attention, in part due to its relationship with other
communication models such as that of adhoc radio networks. However, there has
been no definitive published result for several fundamental tasks in the model.
We aim to rectify this with our paper.
We present algorithms and lower bounds for a variety of fundamental global
communications tasks in the model.

In this paper we present a framework for leader election in multihop radio
networks which yield randomized leader election algorithms taking
$O(\text{broadcasting time})$ in expectation, and another which yields
algorithms taking fixed $O(\sqrt{\log n})$times broadcasting time. Both
succeed with high probability.
We show how to implement these frameworks in radio networks without collision
detection, and in networks with collision detection (in fact in the strictly
weaker beep model). In doing so, we obtain the first optimal expectedtime
leader election algorithms in both settings, and also improve the worstcase
running time in directed networks without collision detection by an $O(\sqrt
{\log n})$ factor.

Due to the big size of data and limited data storage volume of a single
computer or a single server, data are often stored in a distributed manner.
Thus, performing largescale machine learning operations with the distributed
datasets through communication networks is often required. In this paper, we
study the convergence rate of the distributed dual coordinate ascent for
distributed machine learning problems in a general treestructured network.
Since a tree network model can be understood as the generalization of a star
network model, our algorithm can be thought of as the generalization of the
distributed dual coordinate ascent in a star network model. We provide the
convergence rate of the distributed dual coordinate ascent over a general tree
network in a recursive manner and analyze the network effect on the convergence
rate. Secondly, by considering network communication delays, we optimize the
distributed dual coordinate ascent algorithm to maximize its convergence speed.
From our analytical result, we can choose the optimal number of local
iterations depending on the communication delay severity to achieve the fastest
convergence speed. In numerical experiments, we consider machine learning
scenarios over communication networks, where local workers cannot directly
reach to a central node due to constraints in communication, and demonstrate
that the usability of our distributed dual coordinate ascent algorithm in tree
networks. Additionally, we show that adapting number of local and global
iterations to network communication delays in the distributed dual coordinated
ascent algorithm can improve its convergence speed.

In this work, we study the $k$median and $k$means clustering problems when
the data is distributed across many servers and can contain outliers. While
there has been a lot of work on these problems for worstcase instances, we
focus on gaining a finer understanding through the lens of beyond worstcase
analysis. Our main motivation is the following: for many applications such as
clustering proteins by function or clustering communities in a social network,
there is some unknown target clustering, and the hope is that running a
$k$median or $k$means algorithm will produce clusterings which are close to
matching the target clustering. Worstcase results can guarantee constant
factor approximations to the optimal $k$median or $k$means objective value,
but not closeness to the target clustering.
Our first result is a distributed algorithm which returns a nearoptimal
clustering assuming a natural notion of stability, namely, approximation
stability [Balcan et. al 2013], even when a constant fraction of the data are
outliers. The communication complexity is $\tilde O(sk+z)$ where $s$ is the
number of machines, $k$ is the number of clusters, and $z$ is the number of
outliers.
Next, we show this amount of communication cannot be improved even in the
setting when the input satisfies various nonworstcase assumptions. We give a
matching $\Omega(sk+z)$ lower bound on the communication required both for
approximating the optimal $k$means or $k$median cost up to any constant, and
for returning a clustering that is close to the target clustering in Hamming
distance. These lower bounds hold even when the data satisfies approximation
stability or other common notions of stability, and the cluster sizes are
balanced. Therefore, $\Omega(sk+z)$ is a communication bottleneck, even for
realworld instances.

Bayesian matrix factorization (BMF) is a powerful tool for producing lowrank
representations of matrices and for predicting missing values and providing
confidence intervals. Scaling up the posterior inference for massivescale
matrices is challenging and requires distributing both data and computation
over many workers, making communication the main computational bottleneck.
Embarrassingly parallel inference would remove the communication needed, by
using completely independent computations on different data subsets, but it
suffers from the inherent unidentifiability of BMF solutions. We introduce a
hierarchical decomposition of the joint posterior distribution, which couples
the subset inferences, allowing for embarrassingly parallel computations in a
sequence of at most three stages. Using an efficient approximate
implementation, we show improvements empirically on both real and simulated
data. Our distributed approach is able to achieve a speedup of almost an order
of magnitude over the full posterior, with a negligible effect on predictive
accuracy. Our method outperforms stateoftheart embarrassingly parallel MCMC
methods in accuracy, and achieves results competitive to other available
distributed and parallel implementations of BMF.

In systems of programmable matter, we are given a collection of simple
computation elements (or particles) with limited (constantsize) memory. We are
interested in when they can selforganize to solve systemwide problems of
movement, configuration and coordination. Here, we initiate a stochastic
approach to developing robust distributed algorithms for programmable matter
systems using Markov chains. We are able to leverage the wealth of prior work
in Markov chains and related areas to design and rigorously analyze our
distributed algorithms and show that they have several desirable properties.
We study the compression problem, in which a particle system must gather as
tightly together as possible, as in a sphere or its equivalent in the presence
of some underlying geometry. More specifically, we seek fully distributed,
local, and asynchronous algorithms that lead the system to converge to a
configuration with small boundary. We present a Markov chainbased algorithm
that solves the compression problem under the geometric amoebot model, for
particle systems that begin in a connected configuration. The algorithm takes
as input a bias parameter $\lambda$, where $\lambda > 1$ corresponds to
particles favoring having more neighbors. We show that for all $\lambda >
2+\sqrt{2}$, there is a constant $\alpha > 1$ such that eventually with all but
exponentially small probability the particles are $\alpha$compressed, meaning
the perimeter of the system configuration is at most $\alpha \cdot p_{min}$,
where $p_{min}$ is the minimum possible perimeter of the particle system.
Surprisingly, the same algorithm can also be used for expansion when $0 <
\lambda < 2.17$, and we prove similar results about expansion for values of
$\lambda$ in this range. This is counterintuitive as it shows that particles
preferring to be next to each other ($\lambda > 1$) is not sufficient to
guarantee compression.

Intel Xeon Phi manyintegratedcore (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multithreaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a stateofthe art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
stateoftheart graph framework. 4 The subNUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system autotuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.

A group of mobile agents is given a task to explore an edgeweighted graph
$G$, i.e., every vertex of $G$ has to be visited by at least one agent. There
is no centralized unit to coordinate their actions, but they can freely
communicate with each other. The goal is to construct a deterministic strategy
which allows agents to complete their task optimally. In this paper we are
interested in a costoptimal strategy, where the cost is understood as the
total distance traversed by agents coupled with the cost of invoking them. Two
graph classes are analyzed, rings and trees, in the offline and online
setting, i.e., when a structure of a graph is known and not known to agents in
advance. We present algorithms that compute the optimal solutions for a given
ring and tree of order $n$, in $O(n)$ time units. For rings in the online
setting, we give the $2$competitive algorithm and prove the lower bound of
$3/2$ for the competitive ratio for any online strategy. For every strategy
for trees in the online setting, we prove the competitive ratio to be no less
than $2$, which can be achieved by the $DFS$ algorithm.

Today's data analytics frameworks are computecentric, with analytics
execution almost entirely dependent on the predetermined physical structure of
the highlevel computation. Relegating intermediate data to a second class
entity in this manner hurts flexibility, performance, and efficiency. We
present Whiz, a new analytics framework that cleanly separates computation from
intermediate data. It enables runtime visibility into data via programmable
monitoring, and datadriven computation (where intermediate data values drive
when/what computation runs) via an event abstraction. Experiments with a Whiz
prototype on a large cluster using batch, streaming, and graph analytics
workloads show that its performance is 1.32x better than stateoftheart.

Given a system model where machines have distinct speeds and power ratings
but are otherwise compatible, we consider various problems of scheduling under
resource constraints on the system which place the restriction that not all
machines can be run at once. These can be power, energy, or makespan
constraints on the system. Given such constraints, there are problems with
divisible as well as nondivisible jobs. In the setting where there is a
constraint on power, we show that the problem of minimizing makespan for a set
of divisible jobs is NPhard by reduction to the knapsack problem. We then show
that scheduling to minimize energy with power constraints is also NPhard. We
then consider scheduling with energy and makespan constraints with divisible
jobs and show that these can be solved in polynomial time, and the problems
with nondivisible jobs are NPhard. We give exact and approximation algorithms
for these problems as required.

Many distributed systems work on a common shared state; in such systems,
distributed agreement is necessary for consistency. With an increasing number
of servers, these systems become more susceptible to singleserver failures,
increasing the relevance of faulttolerance. Atomic broadcast enables
faulttolerant distributed agreement, yet it is costly to solve. Most practical
algorithms entail linear work per broadcast message. AllConcur  a leaderless
approach  reduces the work, by connecting the servers via a sparse resilient
overlay network; yet, this resiliency entails redundancy, limiting the
reduction of work. In this paper, we propose AllConcur+, an atomic broadcast
algorithm that lifts this limitation: During intervals with no failures, it
achieves minimal work by using a redundancyfree overlay network. When failures
do occur, it automatically recovers by switching to a resilient overlay
network. In our performance evaluation of nonfailure scenarios, AllConcur+
achieves comparable throughput to AllGather  a nonfaulttolerant distributed
agreement algorithm  and outperforms AllConcur, LCR and Libpaxos both in
terms of throughput and latency. Furthermore, our evaluation of failure
scenarios shows that AllConcur+'s expected performance is robust with regard to
occasional failures. Thus, for realistic use cases, leveraging redundancyfree
distributed agreement during intervals with no failures improves performance
significantly.

This paper describes the application of a highlevel language and method in
developing simpler specifications of more complex variants of the Paxos
algorithm for distributed consensus. The specifications are for MultiPaxos
with preemption, replicated state machine, and reconfiguration and optimized
with state reduction and failure detection. The language is DistAlgo. The key
is to express complex control flows and synchronization conditions precisely at
a high level, using nondeterministic waits and messagehistory queries. We
obtain complete executable specifications that are almost completely
declarativeupdating only a number for the protocol round besides the sets of
messages sent and received.
We show the following results: 1.English and pseudocode descriptions of
distributed algorithms can be captured completely and precisely at a high
level, without adding, removing, or reformulating algorithm details to fit
lowerlevel, more abstract, or less direct languages. 2.We created higherlevel
control flows and synchronization conditions than all previous specifications,
and obtained specifications that are much simpler and smaller, even matching or
smaller than abstract specifications that omit many algorithm details. 3.The
simpler specifications led us to easily discover useless replies, unnecessary
delays, and liveness violations (if messages can be lost) in previous published
specifications, by just following the simplified algorithm flows. 4.The
resulting specifications can be executed directly, and we can express
optimizations cleanly, yielding drastic performance improvement over naive
execution and facilitating a general method for merging processes. 5.We
systematically translated the resulting specifications into TLA+ and developed
machinechecked safety proofs, which also allowed us to detect and fix a subtle
safety violation in an earlier unpublished specification.

In this note, we extend the algorithms Extra and subgradientpush to a new
algorithm ExtraPush for consensus optimization with convex differentiable
objective functions over a directed network. When the stationary distribution
of the network can be computed in advance}, we propose a simplified algorithm
called Normalized ExtraPush. Just like Extra, both ExtraPush and Normalized
ExtraPush can iterate with a fixed step size. But unlike Extra, they can take a
columnstochastic mixing matrix, which is not necessarily doubly stochastic.
Therefore, they remove the undirectednetwork restriction of Extra.
Subgradientpush, while also works for directed networks, is slower on the same
type of problem because it must use a sequence of diminishing step sizes.
We present preliminary analysis for ExtraPush under a bounded sequence
assumption. For Normalized ExtraPush, we show that it naturally produces a
bounded, linearly convergent sequence provided that the objective function is
strongly convex.
In our numerical experiments, ExtraPush and Normalized ExtraPush performed
similarly well. They are significantly faster than subgradientpush, even when
we handoptimize the step sizes for the latter.

This paper is a contribution to the classical cops and robber problem on a
graph, directed to twodimensional grids and toroidal grids. These studies are
generally aimed at determining the minimum number of cops needed to capture the
robber and proposing algorithms for the capture. We apply some new concepts to
propose a new solution to the problem on grids that was already solved under a
different approach, and apply these concepts to give efficient algorithms for
the capture on toroidal grids. As for grids, we show that two cops suffice even
in a semitorus (i.e. a grid with toroidal closure in one dimension) and three
cops are necessary and sufficient in a torus. Then we treat the problem in
function of any number k of cops, giving efficient algorithms for grids and
tori and computing lower and upper bounds on the capture time. Conversely we
determine the minimum value of k needed for any given capture time and study a
possible speedup phenomenon.

This paper poses a question about a simple localization problem. The question
is if an {\em oblivious} walker on a linesegment can localize the middle point
of the linesegment in {\em finite} steps observing the direction (i.e., Left
or Right) and the distance to the nearest end point. This problem is arisen
from {\em selfstabilizing} location problems by {\em autonomous mobile robots}
with {\em limited visibility}, that is a widely interested abstract model in
distributed computing. Contrary to appearances, it is far from trivial if this
simple problem is solvable or not, and unsettled yet. This paper is concerned
with three variants of the problem with a minimal relaxation, and presents
selfstabilizing algorithms for them. We also show an easy impossibility
theorem for bilaterally symmetric algorithms.

This paper severs as a user guide to the graph partitioning framework KaHIP
(Karlsruhe High Quality Partitioning). We give a rough overview of the
techniques used within the framework and describe the user interface as well as
the file formats used. Moreover, we provide a short description of the current
library functions provided within the framework. Since version 3.00 we support
multilevel partitioning, memetic algorithms, distributed and sharedmemory
parallel algorithms, node separator and ordering algorithms, edge partitioning
algorithms as well as ILP solvers.

The degree splitting problem requires coloring the edges of a graph red or
blue such that each node has almost the same number of edges in each color, up
to a small additive discrepancy. The directed variant of the problem requires
orienting the edges such that each node has almost the same number of incoming
and outgoing edges, again up to a small additive discrepancy.
We present deterministic distributed algorithms for both variants, which
improve on their counterparts presented by Ghaffari and Su [SODA'17]: our
algorithms are significantly simpler and faster, and have a much smaller
discrepancy. This also leads to a faster and simpler deterministic algorithm
for $(2+o(1))\Delta$edgecoloring, improving on that of Ghaffari and Su.

Paxos is an important algorithm for a set of distributed processes to agree
on a single value or a sequence of values, for which it is called Basic Paxos
or MultiPaxos, respectively. Consensus is critical when distributed services
are replicated for faulttolerance, because nonfaulty replicas must agree on
the state of the system or the sequence of operations that have been performed.
Unfortunately, consensus algorithms including MultiPaxos in particular are
wellknown to be difficult to understand, and their accurate specifications and
correctness proofs remain challenging, despite extensive studies ever since
Lamport introduced Paxos.
This article describes formal specification and verification of Lamport's
MultiPaxos algorithm for distributed consensus. The specification is written
in TLA+, Lamport's Temporal Logic of Actions. The proof is written and
automatically checked using TLAPS, the TLA+ Proof System. The proof is for the
safety property of the algorithm. Building on Lamport, Merz, and Doligez's
specification and proof for Basic Paxos, we aim to facilitate the understanding
of MultiPaxos and its proof by minimizing the difference from those for Basic
Paxos, and to demonstrate a general way of proving other variants of Paxos and
other sophisticated distributed algorithms. We also discuss our general
strategies and results for proving complex invariants using invariance lemmas
and increments, for proving properties about sets and tuples to help the proof
check succeed in significantly reduced time, and for overall proof improvement
leading to considerably reduced proof size.