
Analyzing massive complex networks yields promising insights about our
everyday lives. Building scalable algorithms to do so is a challenging task
that requires a careful analysis and an extensive evaluation. However,
engineering such algorithms is often hindered by the scarcity of
publicly~available~datasets.
Network generators serve as a tool to alleviate this problem by providing
synthetic instances with controllable parameters. However, many network
generators fail to provide instances on a massive scale due to their sequential
nature or resource constraints. Additionally, truly scalable network generators
are few and often limited in their realism.
In this work, we present novel generators for a variety of network models
that are frequently used as benchmarks. By making use of pseudorandomization
and divideandconquer schemes, our generators follow a communicationfree
paradigm. The resulting generators are thus embarrassingly parallel and have a
near optimal scaling behavior. This allows us to generate instances of up to
$2^{43}$ vertices and $2^{47}$ edges in less than 22 minutes on 32768 cores.
Therefore, our generators allow new graph families to be used on an
unprecedented scale.

This paper serves as a user guide to the graph partitioning framework KaHIP
(Karlsruhe High Quality Partitioning). We give a rough overview of the
techniques used within the framework and describe the user interface as well as
the file formats used. Moreover, we provide a short description of the current
library functions provided within the framework.

Distributed programs are often formulated in popular functional frameworks
like MapReduce, Spark and Thrill, but writing efficient algorithms for such
frameworks is usually a nontrivial task. As the costs of running faulty
algorithms at scale can be severe, it is highly desirable to verify their
correctness.
We propose to employ existing imperative reference implementations as
specifications for MapReduce implementations. To this end, we present a novel
verification approach in which equivalence between an imperative and a
MapReduce implementation is established by a series of program transformations.
In this paper, we present how the equivalence framework can be used to prove
equivalence between an imperative implementation of the PageRank algorithm and
its MapReduce variant. The eight individual transformation steps are
individually presented and explained.

MPI uses the concept of communicators to connect groups of processes. It
provides nonblocking collective operations on communicators to overlap
communication and computation. Flexible algorithms demand flexible
communicators. E.g., a process can work on different subproblems within
different process groups simultaneously, new process groups can be created, or
the members of a process group can change. Depending on the number of
communicators, the time for communicator creation can drastically increase the
running time of the algorithm. Furthermore, a new communicator synchronizes all
processes as communicator creation routines are blocking collective operations.
We present RBC, a communication library based on MPI, that creates
rangebased communicators in constant time without communication. These RBC
communicators support (non)blocking pointtopoint communication as well as
(non)blocking collective operations. Our experiments show that the library
reduces the time to create a new communicator by a factor of more than 400
whereas the running time of collective operations remains about the same. We
propose Janus Quicksort, a distributed sorting algorithm that avoids any load
imbalances. We improved the performance of this algorithm by a factor of 15 for
moderate inputs by using RBC communicators. Finally, we discuss different
approaches to bring nonblocking (local) communicator creation of lightweight
(rangebased) communicators into MPI.

We propose fast probabilistic algorithms with low (i.e., sublinear in the
input size) communication volume to check the correctness of operations in Big
Data processing frameworks and distributed databases. Our checkers cover many
of the commonly used operations, including sum, average, median, and minimum
aggregation, as well as sorting, union, merge, and zip. An experimental
evaluation of our implementation in Thrill (Bingmann et al., 2016) confirms the
low overhead and high failure detection rate predicted by theoretical analysis.

Partitioning graphs into blocks of roughly equal size such that few edges run
between blocks is a frequently needed operation in processing graphs. Recently,
size, variety, and structural complexity of these networks has grown
dramatically. Unfortunately, previous approaches to parallel graph partitioning
have problems in this context since they often show a negative tradeoff
between speed and quality. We present an approach to multilevel sharedmemory
parallel graph partitioning that guarantees balanced solutions, shows high
speedups for a variety of large graphs and yields very good quality
independently of the number of cores used. For example, on 31 cores, our
algorithm partitions our largest test instance into 16 blocks cutting less than
half the number of edges than our main competitor when both algorithms are
given the same amount of time. Important ingredients include parallel label
propagation for both coarsening and improvement, parallel initial partitioning,
a simple yet effective approach to parallel localized local search, and fast
locality preserving hash tables.

We present a refinement framework for multilevel hypergraph partitioning that
uses maxflow computations on pairs of blocks to improve the solution quality
of a $k$way partition. The framework generalizes the flowbased improvement
algorithm of KaFFPa from graphs to hypergraphs and is integrated into the
hypergraph partitioner KaHyPar. By reducing the size of hypergraph flow
networks, improving the flow model used in KaFFPa, and developing techniques to
improve the running time of our algorithm, we obtain a partitioner that
computes the best solutions for a wide range of benchmark hypergraphs from
different application areas while still having a running time comparable to
that of hMetis.

MapReduce frameworks are widely used for the implementation of distributed
algorithms. However, translating imperative algorithms into these frameworks
requires significant structural changes to the algorithm. As the costs of
running faulty algorithms at scale can be severe, it is highly desirable to
verify the correctness of the translation, i.e., to prove that the MapReduce
version is equivalent to the imperative original. We present a novel approach
for proving equivalence between imperative and MapReduce algorithms based on
partitioning the equivalence proof into a sequence of equivalence proofs
between intermediate programs with smaller differences. Our approach is based
on the insight that two kinds of subproofs are required: (1) uniform
transformations changing the controlflow structure that are mostly independent
of the particular context in which they are applied; and (2) contextdependent
transformations that are not uniform but that preserve the overall structure
and can be proved correct using coupling invariants. We demonstrate the
feasibility of our approach by evaluating it on two prototypical algorithms
commonly used as examples in MapReduce frameworks: kmeans and PageRank. To
carry out the proofs, we use the interactive theorem prover Coq with partial
proof automation. The results show that our approach and its prototypical
implementation based on Coq enables equivalence proofs of nontrivial
algorithms and could be automated to a large degree.

Main memory columnstores have proven to be efficient for processing
analytical queries. Still, there has been much less work in the context of
clusters. Using only a single machine poses several restrictions: Processing
power and data volume are bounded to the number of cores and main memory
fitting on one tightly coupled system. To enable the processing of larger data
sets, switching to a cluster becomes necessary. In this work, we explore
techniques for efficient execution of analytical SQL queries on large amounts
of data in a parallel database cluster while making maximal use of the
available hardware. This includes precompiled query plans for efficient CPU
utilization, full parallelization on single nodes and across the cluster, and
efficient internode communication. We implement all features in a prototype
for running a subset of TPCH benchmark queries. We evaluate our implementation
using a 128 node cluster running TPCH queries with 30 000 gigabyte of
uncompressed data.

We present a sorting algorithm that works inplace, executes in parallel, is
cacheefficient, avoids branchmispredictions, and performs work O(n log n) for
arbitrary inputs with high probability. The main algorithmic contributions are
new ways to make distributionbased algorithms inplace: On the practical side,
by using coarsegrained blockbased permutations, and on the theoretical side,
we show how to eliminate the recursion stack. Extensive experiments show that
our algorithm IPS$^4$o scales well on a variety of multicore machines. We
outperform our closest inplace competitor by a factor of up to 3. Even as a
sequential algorithm, we are up to 1.5 times faster than the closest sequential
competitor, BlockQuicksort.

We consider space efficient hash tables that can grow and shrink dynamically
and are always highly space efficient, i.e., their space consumption is always
close to the lower bound even while growing and when taking into account
storage that is only needed temporarily. None of the traditionally used hash
tables have this property. We show how known approaches like linear probing and
bucket cuckoo hashing can be adapted to this scenario by subdividing them into
many subtables or using virtual memory overcommitting. However, these rather
straightforward solutions suffer from slow amortized insertion times due to
frequent reallocation in small increments.
Our main result is DySECT ({\bf Dy}namic {\bf S}pace {\bf E}fficient {\bf
C}uckoo {\bf T}able) which avoids these problems. DySECT consists of many
subtables which grow by doubling their size. The resulting inhomogeneity in
subtable sizes is equalized by the flexibility available in bucket cuckoo
hashing where each element can go to several buckets each of which containing
several cells. Experiments indicate that DySECT works well with load factors up
to 98\%. With up to 2.7 times better performance than the next best solution.

Depthfirst search (DFS) is the basis for many efficient graph algorithms. We
introduce general techniques for the efficient implementation of DFSbased
graph algorithms and exemplify them on three algorithms for computing strongly
connected components. The techniques lead to speedups by a factor of two to
three compared to the implementations provided by LEDA and BOOST.
We have obtained similar speedups for biconnected components algorithms. We
also compare the graph data types of LEDA and BOOST.

Cell nuclei segmentation is one of the most important tasks in the analysis
of biomedical images. With evergrowing sizes and amounts of threedimensional
images to be processed, there is a need for better and faster segmentation
methods. Graphbased image segmentation has seen a rise in popularity in recent
years, but is seen as very costly with regard to computational demand. We
propose a new segmentation algorithm which overcomes these limitations. Our
method uses recursive balanced graph partitioning to segment foreground
components of a fast and efficient binarization. We construct a model for the
cell nuclei to guide the partitioning process. Our algorithm is compared to
other stateoftheart segmentation algorithms in an experimental evaluation on
two sets of realistically simulated inputs. Our method is faster, has similar
or better quality and an acceptable memory overhead.

Computing high quality node separators in large graphs is necessary for a
variety of applications, ranging from divideandconquer algorithms to VLSI
design. In this work, we present a novel distributed evolutionary algorithm
tackling the kway node separator problem. A key component of our contribution
includes new kway local search algorithms based on maximum flows. We combine
our local search with a multilevel approach to compute an initial population
for our evolutionary algorithm, and further show how to modify the coarsening
stage of our multilevel algorithm to create effective combine and mutation
operations. Lastly, we combine these techniques with a scalable communication
protocol, producing a system that is able to compute high quality solutions in
a short amount of time. Our experiments against competing algorithms show that
our advanced evolutionary algorithm computes the best result on 94% of the
chosen benchmark instances.

We present a distributed fulltext index for big data applications in a
distributed environment. Our index can answer different types of pattern
matching queries (existential, counting and enumeration). We perform
experiments on inputs up to 100 GiB using up to 512 processors, and compare our
index with the distributed suffix array by Arroyuelo et al. [Parall. Comput.
40(9): 471495, 2014]. The result is that our index answers counting queries
up to 5.5 times faster than the distributed suffix array, while using about the
same space. We also provide a succinct variant of our index that uses only one
third of the memory compared with our nonsuccinct variant, at the expense of
only 20% slower query times.

We investigate distributed memory parallel sorting algorithms that scale to
the largest available machines and are robust with respect to input size and
distribution of the input elements. The main outcome is that four sorting
algorithms cover the entire range of possible input sizes. For three algorithms
we devise new low overhead mechanisms to make them robust with respect to
duplicate keys and skewed input distributions. One of these, designed for
medium sized inputs, is a new variant of quicksort with fast highquality pivot
selection.
At the same time asymptotic analysis provides performance guarantees and
guides the selection and configuration of the algorithms. We validate these
hypotheses using extensive experiments on 7 algorithms, 10 input distributions,
up to 262144 cores, and varying input sizes over 9 orders of magnitude. For
difficult input distributions, our algorithms are the only ones that work at
all. For all but the largest input sizes, we are the first to perform
experiments on such large machines at all and our algorithms significantly
outperform the ones one would conventionally have considered.

We consider the problem of sampling $n$ numbers from the range
$\{1,\ldots,N\}$ without replacement on modern architectures. The main result
is a simple divideandconquer scheme that makes sequential algorithms more
cache efficient and leads to a parallel algorithm running in expected time
$\mathcal{O}\left(n/p+\log p\right)$ on $p$ processors. The amount of
communication between the processors is very small and independent of the
sample size. We also discuss modifications needed for load balancing, reservoir
sampling, online sampling, sampling with replacement, Bernoulli sampling, and
vectorization on SIMD units or GPUs.

Concurrent hash tables are one of the most important concurrent data
structures with numerous applications. Since hash table accesses can dominate
the execution time of the overall application, we need implementations that
achieve good speedup. Unfortunately, currently available concurrent hashing
libraries turn out to be far away from this requirement in particular when
contention on some elements occurs.
Our starting point for better performing data structures is a fast and simple
lockfree concurrent hash table based on linear probing that is limited to
wordsized keyvalue types and does not support dynamic size adaptation. We
explain how to lift these limitations in a provably scalable way and
demonstrate that dynamic growing has a performance overhead comparable to the
same generalization in sequential hash tables.
We perform extensive experiments comparing the performance of our
implementations with six of the most widely used concurrent hash tables. Ours
are considerably faster than the best algorithms with similar restrictions and
an order of magnitude faster than the best more general tables. In some extreme
cases, the difference even approaches four orders of magnitude.

We present the design and a first performance evaluation of Thrill  a
prototype of a general purpose big data processing framework with a convenient
dataflow style programming interface. Thrill is somewhat similar to Apache
Spark and Apache Flink with at least two main differences. First, Thrill is
based on C++ which enables performance advantages due to direct native code
compilation, a more cachefriendly memory layout, and explicit memory
management. In particular, Thrill uses template metaprogramming to compile
chains of subsequent local operations into a single binary routine without
intermediate buffering and with minimal indirections. Second, Thrill uses
arrays rather than multisets as its primary data structure which enables
additional operations like sorting, prefix sums, window scans, or combining
corresponding fields of several arrays (zipping). We compare Thrill with Apache
Spark and Apache Flink using five kernels from the HiBench suite. Thrill is
consistently faster and often several times faster than the other frameworks.
At the same time, the source codes have a similar level of simplicity and
abstraction

Using (a,b)trees as an example, we show how to perform a parallel split with
logarithmic latency and parallel join, bulk updates, intersection, union (or
merge), and (symmetric) set difference with logarithmic latency and with
information theoretically optimal work. We present both asymptotically optimal
solutions and simplified versions that perform well in practice  they are
several times faster than previous implementations.

Systematic validation is an essential part of algorithm development. The
enormous dataset sizes and the complexity observed in many recent timeresolved
3D fluorescence microscopy imaging experiments, however, prohibit a
comprehensive manual ground truth generation. Moreover, existing simulated
benchmarks in this field are often too simple or too specialized to
sufficiently validate the observed image analysis problems. We present a new
semisynthetic approach to generate realistic 3D+t benchmarks that combines
challenging cellular movement dynamics of real embryos with simulated
fluorescent nuclei and artificial image distortions including various
parametrizable options like cell numbers, acquisition deficiencies or multiview
simulations. We successfully applied the approach to simulate the development
of a zebrafish embryo with thousands of cells over 14 hours of its early
existence.

We explain how massive instances of scalefree graphs following the
BarabasiAlbert model can be generated very quickly in an embarrassingly
parallel way. This makes this popular model available for studying big data
graph problems. As a demonstration, we generated a Petaedge graph in less than
an hour.

Computing highquality independent sets quickly is an important problem in
combinatorial optimization. Several recent algorithms have shown that
kernelization techniques can be used to find exact maximum independent sets in
mediumsized sparse graphs, as well as highquality independent sets in huge
sparse graphs that are intractable for exact (exponentialtime) algorithms.
However, a major drawback of these algorithms is that they require significant
preprocessing overhead, and therefore cannot be used to find a highquality
independent set quickly.
In this paper, we show that performing simple kernelization techniques in an
online fashion significantly boosts the performance of local search, and is
much faster than precomputing a kernel using advanced techniques. In addition,
we show that cutting highdegree vertices can boost local search performance
even further, especially on huge (sparse) complex networks. Our experiments
show that we can drastically speed up the computation of large independent sets
compared to other stateoftheart algorithms, while also producing results
that are very close to the best known solutions.

We develop a multilevel algorithm for hypergraph partitioning that contracts
the vertices one at a time. Using several caching and lazyevaluation
techniques during coarsening and refinement, we reduce the running time by up
to twoorders of magnitude compared to a naive $n$level algorithm that would
be adequate for ordinary graph partitioning. The overall performance is even
better than the widely used hMetis hypergraph partitioner that uses a classical
multilevel algorithm with few levels. Aided by a portfoliobased approach to
initial partitioning and adaptive budgeting of imbalance within recursive
bipartitioning, we achieve very high quality. We assembled a large benchmark
set with 310 hypergraphs stemming from application areas such VLSI, SAT
solving, social networks, and scientific computing. We achieve significantly
smaller cuts than hMetis and PaToH, while being faster than hMetis.
Considerably larger improvements are observed for some instance classes like
social networks, for bipartitioning, and for partitions with an allowed
imbalance of 10%. The algorithm presented in this work forms the basis of our
hypergraph partitioning framework KaHyPar (Karlsruhe Hypergraph Partitioning).

We present scalable parallel algorithms with sublinear perprocessor
communication volume and low latency for several fundamental problems related
to finding the most relevant elements in a set, for various notions of
relevance: We begin with the classical selection problem with unsorted input.
We present generalizations with locally sorted inputs, dynamic content
(bulkparallel priority queues), and multiple criteria. Then we move on to
finding frequent objects and topk sum aggregation. Since it is unavoidable
that the output of these algorithms might be unevenly distributed over the
processors, we also explain how to redistribute this data with minimal
communication.