
We study highdimensional distribution learning in an agnostic setting where
an adversary is allowed to arbitrarily corrupt an $\varepsilon$fraction of the
samples. Such questions have a rich history spanning statistics, machine
learning and theoretical computer science. Even in the most basic settings, the
only known approaches are either computationally inefficient or lose
dimensiondependent factors in their error guarantees. This raises the
following question:Is highdimensional agnostic distribution learning even
possible, algorithmically?
In this work, we obtain the first computationally efficient algorithms with
dimensionindependent error guarantees for agnostically learning several
fundamental classes of highdimensional distributions: (1) a single Gaussian,
(2) a product distribution on the hypercube, (3) mixtures of two product
distributions (under a natural balancedness condition), and (4) mixtures of
spherical Gaussians. Our algorithms achieve error that is independent of the
dimension, and in many cases scales nearlylinearly with the fraction of
adversarially corrupted samples. Moreover, we develop a general recipe for
detecting and correcting corruptions in highdimensions, that may be applicable
to many other problems.

While Generative Adversarial Networks (GANs) have demonstrated promising
performance on multiple vision tasks, their learning dynamics are not yet well
understood, both in theory and in practice. To address this issue, we study GAN
dynamics in a simple yet rich parametric model that exhibits several of the
common problematic convergence behaviors such as vanishing gradients, mode
collapse, and diverging or oscillatory behavior. In spite of the nonconvex
nature of our model, we are able to perform a rigorous theoretical analysis of
its convergence behavior. Our analysis reveals an interesting dichotomy: a GAN
with an optimal discriminator provably converges, while first order
approximations of the discriminator steps lead to unstable GAN dynamics and
mode collapse. Our result suggests that using first order discriminator steps
(the defacto standard in most existing GAN setups) might be one of the factors
that makes GAN training challenging in practice.

We design nearly optimal differentially private algorithms for learning two
fundamental families of highdimensional distributions in total variation
distance: multivariate Gaussians in $\mathbb{R}^{d}$ and product distributions
on the hypercube. The sample complexity of both our algorithms approaches the
sample complexity of nonprivate learners up to a small multiplicative factor
and an additional additive term that is lower order for a wide range of
parameters, showing that privacy comes essentially for free for these problems.
Our algorithms use a novel technical approach to reducing the sensitivity of
the estimation procedure that we call recursive private preconditioning and may
find additional applications.

This paper studies the problem of distributed stochastic optimization in an
adversarial setting where, out of the $m$ machines which allegedly compute
stochastic gradients every iteration, an $\alpha$fraction are Byzantine, and
can behave arbitrarily and adversarially. Our main result is a variant of
stochastic gradient descent (SGD) which finds $\varepsilon$approximate
minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m}
+ \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional
minibatch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations,
but cannot tolerate Byzantine failures. Further, we provide a lower bound
showing that, up to logarithmic factors, our algorithm is
informationtheoretically optimal both in terms of sampling complexity and time
complexity.

Robust estimation is much more challenging in high dimensions than it is in
one dimension: Most techniques either lead to intractable optimization problems
or estimators that can tolerate only a tiny fraction of errors. Recent work in
theoretical computer science has shown that, in appropriate distributional
models, it is possible to robustly estimate the mean and covariance with
polynomial time algorithms that can tolerate a constant fraction of
corruptions, independent of the dimension. However, the sample and time
complexity of these algorithms is prohibitively large for highdimensional
applications. In this work, we address both of these issues by establishing
sample complexity bounds that are optimal, up to logarithmic factors, as well
as giving various refinements that allow the algorithms to tolerate a much
larger fraction of corruptions. Finally, we show on both synthetic and real
data that our algorithms have stateoftheart performance and suddenly make
highdimensional robust estimation a realistic possibility.

In high dimensions, most machine learning methods are brittle to even a small
fraction of structured outliers. To address this, we introduce a new
metaalgorithm that can take in a base learner such as least squares or
stochastic gradient descent, and harden the learner to be resistant to
outliers. Our method, Sever, possesses strong theoretical guarantees yet is
also highly scalable  beyond running the base learner itself, it only
requires computing the top singular vector of a certain $n \times d$ matrix. We
apply Sever on a drug design dataset and a spam classification dataset, and
find that in both cases it has substantially greater robustness than several
baselines. On the spam dataset, with $1\%$ corruptions, we achieved $7.4\%$
test error, compared to $13.4\%20.5\%$ for the baselines, and $3\%$ error on
the uncorrupted dataset. Similarly, on the drug design dataset, with $10\%$
corruptions, we achieved $1.42$ meansquared error test error, compared to
$1.51$$2.33$ for the baselines, and $1.23$ error on the uncorrupted dataset.

We study the problem of robustly learning multidimensional histograms. A
$d$dimensional function $h: D \rightarrow \mathbb{R}$ is called a
$k$histogram if there exists a partition of the domain $D \subseteq
\mathbb{R}^d$ into $k$ axisaligned rectangles such that $h$ is constant within
each such rectangle. Let $f: D \rightarrow \mathbb{R}$ be a $d$dimensional
probability density function and suppose that $f$ is $\mathrm{OPT}$close, in
$L_1$distance, to an unknown $k$histogram (with unknown partition). Our goal
is to output a hypothesis that is $O(\mathrm{OPT}) + \epsilon$ close to $f$, in
$L_1$distance. We give an algorithm for this learning problem that uses $n =
\tilde{O}_d(k/\epsilon^2)$ samples and runs in time $\tilde{O}_d(n)$. For any
fixed dimension, our algorithm has optimal sample complexity, up to logarithmic
factors, and runs in nearlinear time. Prior to our work, the time complexity
of the $d=1$ case was wellunderstood, but significant gaps in our
understanding remained even for $d=2$.

Recently there has been significant interest in training machinelearning
models at low precision: by reducing precision, one can reduce computation and
communication by one order of magnitude. We examine training at reduced
precision, both from a theoretical and practical perspective, and ask: is it
possible to train models at endtoend low precision with provable guarantees?
Can this lead to consistent orderofmagnitude speedups? We present a framework
called ZipML to answer these questions. For linear models, the answer is yes.
We develop a simple framework based on one simple but novel strategy called
double sampling. Our framework is able to execute training at low precision
with no bias, guaranteeing convergence, whereas naive quantization would
introduce significant bias. We validate our framework across a range of
applications, and show that it enables an FPGA prototype that is up to 6.5x
faster than an implementation using full 32bit precision. We further develop a
varianceoptimal stochastic quantization strategy and show that it can make a
significant difference in a variety of settings. When applied to linear models
together with double sampling, we save up to another 1.7x in data movement
compared with uniform quantization. When training deep networks with quantized
models, we achieve higher accuracy than the stateoftheart XNORNet. Finally,
we extend our framework through approximation to nonlinear models, such as
SVM. We show that, although using lowprecision data induces bias, we can
appropriately bound and control the bias. We find in practice 8bit precision
is often sufficient to converge to the correct solution. Interestingly,
however, in practice we notice that our framework does not always outperform
the naive rounding approach. We discuss this negative result in detail.

Consider the following random process: we are given $n$ queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in stateoftheart concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the singlechoice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is $O( n )$ while the expected worstcase
cost is $O( n \log n )$. These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
ballsintobins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance.

Parallel implementations of stochastic gradient descent (SGD) have received
significant research attention, thanks to excellent scalability properties of
this algorithm, and to its efficiency in the context of training deep neural
networks. A fundamental barrier for parallelizing largescale SGD is the fact
that the cost of communicating the gradient updates between nodes can be very
large. Consequently, lossy compression heuristics have been proposed, by which
nodes only communicate quantized gradients. Although effective in practice,
these heuristics do not always provably converge, and it is not clear whether
they are optimal.
In this paper, we propose Quantized SGD (QSGD), a family of compression
schemes which allow the compression of gradient updates at each node, while
guaranteeing convergence under standard assumptions. QSGD allows the user to
trade off compression and convergence time: it can communicate a sublinear
number of bits per iteration in the model dimension, and can achieve
asymptotically optimal communication cost. We complement our theoretical
results with empirical data, showing that QSGD can significantly reduce
communication cost, while being competitive with standard uncompressed
techniques on a variety of real tasks.
In particular, experiments show that gradient quantization applied to
training of deep neural networks for image classification and automated speech
recognition can lead to significant reductions in communication cost, and
endtoend training time. For instance, on 16 GPUs, we are able to train a
ResNet152 network on ImageNet 1.8x faster to full accuracy. Of note, we show
that there exist generic parameter settings under which all known network
architectures preserve or slightly improve their full accuracy when using
quantization.

We study the fundamental problem of learning the parameters of a
highdimensional Gaussian in the presence of noise  where an
$\varepsilon$fraction of our samples were chosen by an adversary. We give
robust estimators that achieve estimation error $O(\varepsilon)$ in the total
variation distance, which is optimal up to a universal constant that is
independent of the dimension.
In the case where just the mean is unknown, our robustness guarantee is
optimal up to a factor of $\sqrt{2}$ and the running time is polynomial in $d$
and $1/\epsilon$. When both the mean and covariance are unknown, the running
time is polynomial in $d$ and quasipolynomial in $1/\varepsilon$. Moreover all
of our algorithms require only a polynomial number of samples. Our work shows
that the same sorts of error guarantees that were established over fifty years
ago in the onedimensional setting can also be achieved by efficient algorithms
in highdimensional settings.

In this paper we initiate the study of whether or not sparse estimation tasks
can be performed efficiently in high dimensions, in the robust setting where an
$\eps$fraction of samples are corrupted adversarially. We study the natural
robust version of two classical sparse estimation problems, namely, sparse mean
estimation and sparse PCA in the spiked covariance model. For both of these
problems, we provide the first efficient algorithms that provide nontrivial
error guarantees in the presence of noise, using only a number of samples which
is similar to the number required for these problems without noise. In
particular, our sample complexities are sublinear in the ambient dimension $d$.
Our work also suggests evidence for new computationalvsstatistical gaps for
these problems (similar to those for sparse PCA without noise) which only arise
in the presence of noise.

We study the fixed design segmented regression problem: Given noisy samples
from a piecewise linear function $f$, we want to recover $f$ up to a desired
accuracy in meansquared error.
Previous rigorous approaches for this problem rely on dynamic programming
(DP) and, while sample efficient, have running time quadratic in the sample
size. As our main contribution, we provide new sample nearlinear time
algorithms for the problem that  while not being minimax optimal  achieve a
significantly better sampletime tradeoff on large datasets compared to the DP
approach. Our experimental evaluation shows that, compared with the DP
approach, our algorithms provide a convergence rate that is only off by a
factor of $2$ to $4$, while achieving speedups of three orders of magnitude.

Learning a Gaussian mixture model (GMM) is a fundamental problem in machine
learning, learning theory, and statistics. One notion of learning a GMM is
proper learning: here, the goal is to find a mixture of $k$ Gaussians
$\mathcal{M}$ that is close to the density $f$ of the unknown distribution from
which we draw samples. The distance between $\mathcal{M}$ and $f$ is typically
measured in the total variation or $L_1$norm.
We give an algorithm for learning a mixture of $k$ univariate Gaussians that
is nearly optimal for any fixed $k$. The sample complexity of our algorithm is
$\tilde{O}(\frac{k}{\epsilon^2})$ and the running time is $(k \cdot
\log\frac{1}{\epsilon})^{O(k^4)} + \tilde{O}(\frac{k}{\epsilon^2})$. It is
wellknown that this sample complexity is optimal (up to logarithmic factors),
and it was already achieved by prior work. However, the best known time
complexity for proper learning a $k$GMM was
$\tilde{O}(\frac{1}{\epsilon^{3k1}})$. In particular, the dependence between
$\frac{1}{\epsilon}$ and $k$ was exponential. We significantly improve this
dependence by replacing the $\frac{1}{\epsilon}$ term with a $\log
\frac{1}{\epsilon}$ while only increasing the exponent moderately. Hence, for
any fixed $k$, the $\tilde{O} (\frac{k}{\epsilon^2})$ term dominates our
running time, and thus our algorithm runs in time which is nearlylinear in the
number of samples drawn. Achieving a running time of $\textrm{poly}(k,
\frac{1}{\epsilon})$ for proper learning of $k$GMMs has recently been stated
as an open problem by multiple researchers, and we make progress on this
question.
Moreover, our approach offers an agnostic learning guarantee: our algorithm
returns a good GMM even if the distribution we are sampling from is not a
mixture of Gaussians. To the best of our knowledge, our algorithm is the first
agnostic proper learning algorithm for GMMs.

We design a new, fast algorithm for agnostically learning univariate
probability distributions whose densities are well approximated by piecewise
polynomial functions. Let $f$ be the density function of an arbitrary
univariate distribution, and suppose that $f$ is $\mathrm{OPT}$close in
$L_1$distance to an unknown piecewise polynomial function with $t$ interval
pieces and degree $d$. Our algorithm draws $n = O(t(d+1)/\epsilon^2)$ samples
from $f$, runs in time $\tilde{O}(n \cdot \mathrm{poly}(d))$, and with
probability at least $9/10$ outputs an $O(t)$piecewise degree$d$ hypothesis
$h$ that is $4 \cdot \mathrm{OPT} +\epsilon$ close to $f$.
Our general algorithm yields (nearly) sampleoptimal and nearlylinear time
estimators for a wide range of structured distribution families over both
continuous and discrete domains in a unified way. For most of our applications,
these are the first sampleoptimal and nearlylinear time estimators in the
literature. As a consequence, our work resolves the sample and computational
complexities of a broad class of inference tasks via a single "metaalgorithm".
Moreover, we experimentally demonstrate that our algorithm performs very well
in practice.
Our algorithm consists of three "levels": (i) At the top level, we employ an
iterative greedy algorithm for finding a good partition of the real line into
the pieces of a piecewise polynomial. (ii) For each piece, we show that the
subproblem of finding a good polynomial fit on the current interval can be
solved efficiently with a separation oracle method. (iii) We reduce the task of
finding a separating hyperplane to a combinatorial problem and give an
efficient algorithm for this problem. Combining these three procedures gives a
density estimation algorithm with the claimed guarantees.

A Fibonacci heap is a deterministic data structure implementing a priority
queue with optimal amortized operation costs. An unfortunate aspect of
Fibonacci heaps is that they must maintain a "mark bit" which serves only to
ensure efficiency of heap operations, not correctness. Karger proposed a simple
randomized variant of Fibonacci heaps in which mark bits are replaced by coin
flips. This variant still has expected amortized cost $O(1)$ for insert,
decreasekey, and merge. Karger conjectured that this data structure has
expected amortized cost $O(\log s)$ for deletemin, where $s$ is the number of
heap operations.
We give a tight analysis of Karger's randomized Fibonacci heaps, resolving
Karger's conjecture. Specifically, we obtain matching upper and lower bounds of
$\Theta(\log^2 s / \log \log s)$ for the runtime of deletemin. We also prove a
tight lower bound of $\Omega(\sqrt{n})$ on deletemin in terms of the number of
heap elements $n$. The request sequence used to prove this bound also solves an
open problem of Fredman on whether cascading cuts are necessary. Finally, we
give a simple additional modification to these heaps which yields a tight
runtime $O(\log^2 n / \log \log n)$ for deletemin.

All consensus hierarchies in the literature assume that we have, in addition
to copies of a given object, an unbounded number of registers. But why do we
really need these registers?
This paper considers what would happen if one attempts to solve consensus
using various objects but without any registers. We show that under a
reasonable assumption, objects like queues and stacks cannot emulate the
missing registers. We also show that, perhaps surprisingly, initialization,
shown to have no computational consequences when registers are readily
available, is crucial in determining the synchronization power of objects when
no registers are allowed. Finally, we show that without registers, the number
of available objects affects the level of consensus that can be solved.
Our work thus raises the question of whether consensus hierarchies which
assume an unbounded number of registers truly capture synchronization power,
and begins a line of research aimed at better understanding the interaction
between readwrite memory and the powerful synchronization operations available
on modern architectures.

Query evaluation in tupleindependent probabilistic databases is the problem
of computing the probability of an answer to a query given independent
probabilities of the individual tuples in a database instance. There are two
main approaches to this problem: (1) in `grounded inference' one first obtains
the lineage for the query and database instance as a Boolean formula, then
performs weighted model counting on the lineage (i.e., computes the probability
of the lineage given probabilities of its independent Boolean variables); (2)
in methods known as `lifted inference' or `extensional query evaluation', one
exploits the highlevel structure of the query as a firstorder formula.
Although it is widely believed that lifted inference is strictly more powerful
than grounded inference on the lineage alone, no formal separation has
previously been shown for query evaluation. In this paper we show such a formal
separation for the first time.
We exhibit a class of queries for which model counting can be done in
polynomial time using extensional query evaluation, whereas the algorithms used
in stateoftheart exact model counters on their lineages provably require
exponential time. Our lower bounds on the running times of these exact model
counters follow from new exponential size lower bounds on the kinds of dDNNF
representations of the lineages that these model counters (either explicitly or
implicitly) produce. Though some of these queries have been studied before, no
nontrivial lower bounds on the sizes of these representations for these
queries were previously known.

The best current methods for exactly computing the number of satisfying
assignments, or the satisfying probability, of Boolean formulas can be seen,
either directly or indirectly, as building 'decisionDNNF' (decision
decomposable negation normal form) representations of the input Boolean
formulas. DecisionDNNFs are a special case of 'dDNNF's where 'd' stands for
'deterministic'. We show that any decisionDNNF can be converted into an
equivalent 'FBDD' (free binary decision diagram)  also known as a 'readonce
branching program' (ROBP or 1BP)  with only a quasipolynomial increase in
representation size in general, and with only a polynomial increase in size in
the special case of monotone kDNF formulas. Leveraging known exponential lower
bounds for FBDDs, we then obtain similar exponential lower bounds for
decisionDNNFs which provide lower bounds for the recent algorithms. We also
separate the power of decisionDNNFs from dDNNFs and a generalization of
decisionDNNFs known as ANDFBDDs. Finally we show how these imply exponential
lower bounds for natural problems associated with probabilistic databases.