The $k$-center problem is a classical combinatorial optimization problem
which asks to find $k$ centers such that the maximum distance of any input
point in a set $P$ to its assigned center is minimized. The problem allows for
elegant $2$-approximations. However, the situation becomes significantly more
difficult when constraints are added to the problem. We raise the question
whether general methods can be derived to turn an approximation algorithm for a
clustering problem with some constraints into an approximation algorithm that
respects one constraint more. Our constraint of choice is privacy: Here, we are
asked to only open a center when at least $\ell$ clients will be assigned to
it. We show how to combine privacy with several other constraints.
In the Steiner Forest problem, we are given a graph and a collection of
source-sink pairs, and the goal is to find a subgraph of minimum total length
such that all pairs are connected. The problem is APX-Hard and can be
2-approximated by, e.g., the elegant primal-dual algorithm of Agrawal, Klein,
and Ravi from 1995.
We give a local-search-based constant-factor approximation for the problem.
Local search brings in new techniques to an area that has for long not seen any
improvements and might be a step towards a combinatorial algorithm for the more
general survivable network design problem. Moreover, local search was an
essential tool to tackle the dynamic MST/Steiner Tree problem, whereas dynamic
Steiner Forest is still wide open.
It is easy to see that any constant factor local search algorithm requires
steps that add/drop many edges together. We propose natural local moves which,
at each step, either (a) add a shortest path in the current graph and then drop
a bunch of inessential edges, or (b) add a set of edges to the current
solution. This second type of moves is motivated by the potential function we
use to measure progress, combining the cost of the solution with a penalty for
each connected component. Our carefully-chosen local moves and potential
function work in tandem to eliminate bad local minima that arise when using
more traditional local moves.
The $k$-means algorithm is one of the most widely used clustering heuristics.
Despite its simplicity, analyzing its running time and quality of approximation
is surprisingly difficult and can lead to deep insights that can be used to
improve the algorithm. In this paper we survey the recent results in this
direction as well as several extension of the basic $k$-means method.
The k-means problem consists of finding k centers in the d-dimensional
Euclidean space that minimize the sum of the squared distances of all points in
an input set P to their closest respective center. Awasthi et. al. recently
showed that there exists a constant c > 1 such that it is NP-hard to
approximate the k-means objective within a factor of c. We establish that the
constant c is at least 1.0013.
In recent years, there have been major efforts to develop data stream
algorithms that process inputs in one pass over the data with little memory
requirement. For the $k$-means problem, this has led to the development of
several $(1+\varepsilon)$-approximations (under the assumption that $k$ is a
constant), but also to the design of algorithms that are extremely fast in
practice and compute solutions of high accuracy. However, when not only the
length of the stream is high but also the dimensionality of the input points,
then current methods reach their limits.
We propose two algorithms, piecy and piecy-mr that are based on the recently
developed data stream algorithm BICO that can process high dimensional data in
one pass and output a solution of high quality. While piecy is suited for high
dimensional data with a medium number of points, piecy-mr is meant for high
dimensional data that comes in a very long stream. We provide an extensive
experimental study to evaluate piecy and piecy-mr that shows the strength of
the new algorithms.
We develop the heuristic PROBI for the probabilistic Euclidean k-median
problem based on a coreset construction by Lammersen et al. Our algorithm
computes a summary of the data and then uses an adapted version of k-means++
(Arthur and Vassilvitskii, 2007) to compute a good solution on the summary. The
summary is maintained in a data stream, so PROBI can be used in a data stream
setting on very large data sets. We experimentally evaluate the quality of the
summary and of the computed solution and compare the running time to state of
the art data stream clustering algorithms.