• We study high-dimensional distribution learning in an agnostic setting where an adversary is allowed to arbitrarily corrupt an $\varepsilon$-fraction of the samples. Such questions have a rich history spanning statistics, machine learning and theoretical computer science. Even in the most basic settings, the only known approaches are either computationally inefficient or lose dimension-dependent factors in their error guarantees. This raises the following question:Is high-dimensional agnostic distribution learning even possible, algorithmically? In this work, we obtain the first computationally efficient algorithms with dimension-independent error guarantees for agnostically learning several fundamental classes of high-dimensional distributions: (1) a single Gaussian, (2) a product distribution on the hypercube, (3) mixtures of two product distributions (under a natural balancedness condition), and (4) mixtures of spherical Gaussians. Our algorithms achieve error that is independent of the dimension, and in many cases scales nearly-linearly with the fraction of adversarially corrupted samples. Moreover, we develop a general recipe for detecting and correcting corruptions in high-dimensions, that may be applicable to many other problems.
  • While Generative Adversarial Networks (GANs) have demonstrated promising performance on multiple vision tasks, their learning dynamics are not yet well understood, both in theory and in practice. To address this issue, we study GAN dynamics in a simple yet rich parametric model that exhibits several of the common problematic convergence behaviors such as vanishing gradients, mode collapse, and diverging or oscillatory behavior. In spite of the non-convex nature of our model, we are able to perform a rigorous theoretical analysis of its convergence behavior. Our analysis reveals an interesting dichotomy: a GAN with an optimal discriminator provably converges, while first order approximations of the discriminator steps lead to unstable GAN dynamics and mode collapse. Our result suggests that using first order discriminator steps (the de-facto standard in most existing GAN setups) might be one of the factors that makes GAN training challenging in practice.
  • We design nearly optimal differentially private algorithms for learning two fundamental families of high-dimensional distributions in total variation distance: multivariate Gaussians in $\mathbb{R}^{d}$ and product distributions on the hypercube. The sample complexity of both our algorithms approaches the sample complexity of non-private learners up to a small multiplicative factor and an additional additive term that is lower order for a wide range of parameters, showing that privacy comes essentially for free for these problems. Our algorithms use a novel technical approach to reducing the sensitivity of the estimation procedure that we call recursive private preconditioning and may find additional applications.
  • This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$-approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sampling complexity and time complexity.
  • Robust estimation is much more challenging in high dimensions than it is in one dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.
  • In high dimensions, most machine learning methods are brittle to even a small fraction of structured outliers. To address this, we introduce a new meta-algorithm that can take in a base learner such as least squares or stochastic gradient descent, and harden the learner to be resistant to outliers. Our method, Sever, possesses strong theoretical guarantees yet is also highly scalable -- beyond running the base learner itself, it only requires computing the top singular vector of a certain $n \times d$ matrix. We apply Sever on a drug design dataset and a spam classification dataset, and find that in both cases it has substantially greater robustness than several baselines. On the spam dataset, with $1\%$ corruptions, we achieved $7.4\%$ test error, compared to $13.4\%-20.5\%$ for the baselines, and $3\%$ error on the uncorrupted dataset. Similarly, on the drug design dataset, with $10\%$ corruptions, we achieved $1.42$ mean-squared error test error, compared to $1.51$-$2.33$ for the baselines, and $1.23$ error on the uncorrupted dataset.
  • We study the problem of robustly learning multi-dimensional histograms. A $d$-dimensional function $h: D \rightarrow \mathbb{R}$ is called a $k$-histogram if there exists a partition of the domain $D \subseteq \mathbb{R}^d$ into $k$ axis-aligned rectangles such that $h$ is constant within each such rectangle. Let $f: D \rightarrow \mathbb{R}$ be a $d$-dimensional probability density function and suppose that $f$ is $\mathrm{OPT}$-close, in $L_1$-distance, to an unknown $k$-histogram (with unknown partition). Our goal is to output a hypothesis that is $O(\mathrm{OPT}) + \epsilon$ close to $f$, in $L_1$-distance. We give an algorithm for this learning problem that uses $n = \tilde{O}_d(k/\epsilon^2)$ samples and runs in time $\tilde{O}_d(n)$. For any fixed dimension, our algorithm has optimal sample complexity, up to logarithmic factors, and runs in near-linear time. Prior to our work, the time complexity of the $d=1$ case was well-understood, but significant gaps in our understanding remained even for $d=2$.
  • Recently there has been significant interest in training machine-learning models at low precision: by reducing precision, one can reduce computation and communication by one order of magnitude. We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees? Can this lead to consistent order-of-magnitude speedups? We present a framework called ZipML to answer these questions. For linear models, the answer is yes. We develop a simple framework based on one simple but novel strategy called double sampling. Our framework is able to execute training at low precision with no bias, guaranteeing convergence, whereas naive quantization would introduce significant bias. We validate our framework across a range of applications, and show that it enables an FPGA prototype that is up to 6.5x faster than an implementation using full 32-bit precision. We further develop a variance-optimal stochastic quantization strategy and show that it can make a significant difference in a variety of settings. When applied to linear models together with double sampling, we save up to another 1.7x in data movement compared with uniform quantization. When training deep networks with quantized models, we achieve higher accuracy than the state-of-the-art XNOR-Net. Finally, we extend our framework through approximation to non-linear models, such as SVM. We show that, although using low-precision data induces bias, we can appropriately bound and control the bias. We find in practice 8-bit precision is often sufficient to converge to the correct solution. Interestingly, however, in practice we notice that our framework does not always outperform the naive rounding approach. We discuss this negative result in detail.
  • Consider the following random process: we are given $n$ queues, into which elements of increasing labels are inserted uniformly at random. To remove an element, we pick two queues at random, and remove the element of lower label (higher priority) among the two. The cost of a removal is the rank of the label removed, among labels still present in any of the queues, that is, the distance from the optimal choice at each step. Variants of this strategy are prevalent in state-of-the-art concurrent priority queue implementations. Nonetheless, it is not known whether such implementations provide any rank guarantees, even in a sequential model. We answer this question, showing that this strategy provides surprisingly strong guarantees: Although the single-choice process, where we always insert and remove from a single randomly chosen queue, has degrading cost, going to infinity as we increase the number of steps, in the two choice process, the expected rank of a removed element is $O( n )$ while the expected worst-case cost is $O( n \log n )$. These bounds are tight, and hold irrespective of the number of steps for which we run the process. The argument is based on a new technical connection between "heavily loaded" balls-into-bins processes and priority scheduling. Our analytic results inspire a new concurrent priority queue implementation, which improves upon the state of the art in terms of practical performance.
  • Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to excellent scalability properties of this algorithm, and to its efficiency in the context of training deep neural networks. A fundamental barrier for parallelizing large-scale SGD is the fact that the cost of communicating the gradient updates between nodes can be very large. Consequently, lossy compression heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always provably converge, and it is not clear whether they are optimal. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions. QSGD allows the user to trade off compression and convergence time: it can communicate a sublinear number of bits per iteration in the model dimension, and can achieve asymptotically optimal communication cost. We complement our theoretical results with empirical data, showing that QSGD can significantly reduce communication cost, while being competitive with standard uncompressed techniques on a variety of real tasks. In particular, experiments show that gradient quantization applied to training of deep neural networks for image classification and automated speech recognition can lead to significant reductions in communication cost, and end-to-end training time. For instance, on 16 GPUs, we are able to train a ResNet-152 network on ImageNet 1.8x faster to full accuracy. Of note, we show that there exist generic parameter settings under which all known network architectures preserve or slightly improve their full accuracy when using quantization.
  • We study the fundamental problem of learning the parameters of a high-dimensional Gaussian in the presence of noise -- where an $\varepsilon$-fraction of our samples were chosen by an adversary. We give robust estimators that achieve estimation error $O(\varepsilon)$ in the total variation distance, which is optimal up to a universal constant that is independent of the dimension. In the case where just the mean is unknown, our robustness guarantee is optimal up to a factor of $\sqrt{2}$ and the running time is polynomial in $d$ and $1/\epsilon$. When both the mean and covariance are unknown, the running time is polynomial in $d$ and quasipolynomial in $1/\varepsilon$. Moreover all of our algorithms require only a polynomial number of samples. Our work shows that the same sorts of error guarantees that were established over fifty years ago in the one-dimensional setting can also be achieved by efficient algorithms in high-dimensional settings.
  • In this paper we initiate the study of whether or not sparse estimation tasks can be performed efficiently in high dimensions, in the robust setting where an $\eps$-fraction of samples are corrupted adversarially. We study the natural robust version of two classical sparse estimation problems, namely, sparse mean estimation and sparse PCA in the spiked covariance model. For both of these problems, we provide the first efficient algorithms that provide non-trivial error guarantees in the presence of noise, using only a number of samples which is similar to the number required for these problems without noise. In particular, our sample complexities are sublinear in the ambient dimension $d$. Our work also suggests evidence for new computational-vs-statistical gaps for these problems (similar to those for sparse PCA without noise) which only arise in the presence of noise.
  • We study the fixed design segmented regression problem: Given noisy samples from a piecewise linear function $f$, we want to recover $f$ up to a desired accuracy in mean-squared error. Previous rigorous approaches for this problem rely on dynamic programming (DP) and, while sample efficient, have running time quadratic in the sample size. As our main contribution, we provide new sample near-linear time algorithms for the problem that -- while not being minimax optimal -- achieve a significantly better sample-time tradeoff on large datasets compared to the DP approach. Our experimental evaluation shows that, compared with the DP approach, our algorithms provide a convergence rate that is only off by a factor of $2$ to $4$, while achieving speedups of three orders of magnitude.
  • Learning a Gaussian mixture model (GMM) is a fundamental problem in machine learning, learning theory, and statistics. One notion of learning a GMM is proper learning: here, the goal is to find a mixture of $k$ Gaussians $\mathcal{M}$ that is close to the density $f$ of the unknown distribution from which we draw samples. The distance between $\mathcal{M}$ and $f$ is typically measured in the total variation or $L_1$-norm. We give an algorithm for learning a mixture of $k$ univariate Gaussians that is nearly optimal for any fixed $k$. The sample complexity of our algorithm is $\tilde{O}(\frac{k}{\epsilon^2})$ and the running time is $(k \cdot \log\frac{1}{\epsilon})^{O(k^4)} + \tilde{O}(\frac{k}{\epsilon^2})$. It is well-known that this sample complexity is optimal (up to logarithmic factors), and it was already achieved by prior work. However, the best known time complexity for proper learning a $k$-GMM was $\tilde{O}(\frac{1}{\epsilon^{3k-1}})$. In particular, the dependence between $\frac{1}{\epsilon}$ and $k$ was exponential. We significantly improve this dependence by replacing the $\frac{1}{\epsilon}$ term with a $\log \frac{1}{\epsilon}$ while only increasing the exponent moderately. Hence, for any fixed $k$, the $\tilde{O} (\frac{k}{\epsilon^2})$ term dominates our running time, and thus our algorithm runs in time which is nearly-linear in the number of samples drawn. Achieving a running time of $\textrm{poly}(k, \frac{1}{\epsilon})$ for proper learning of $k$-GMMs has recently been stated as an open problem by multiple researchers, and we make progress on this question. Moreover, our approach offers an agnostic learning guarantee: our algorithm returns a good GMM even if the distribution we are sampling from is not a mixture of Gaussians. To the best of our knowledge, our algorithm is the first agnostic proper learning algorithm for GMMs.
  • We design a new, fast algorithm for agnostically learning univariate probability distributions whose densities are well approximated by piecewise polynomial functions. Let $f$ be the density function of an arbitrary univariate distribution, and suppose that $f$ is $\mathrm{OPT}$-close in $L_1$-distance to an unknown piecewise polynomial function with $t$ interval pieces and degree $d$. Our algorithm draws $n = O(t(d+1)/\epsilon^2)$ samples from $f$, runs in time $\tilde{O}(n \cdot \mathrm{poly}(d))$, and with probability at least $9/10$ outputs an $O(t)$-piecewise degree-$d$ hypothesis $h$ that is $4 \cdot \mathrm{OPT} +\epsilon$ close to $f$. Our general algorithm yields (nearly) sample-optimal and nearly-linear time estimators for a wide range of structured distribution families over both continuous and discrete domains in a unified way. For most of our applications, these are the first sample-optimal and nearly-linear time estimators in the literature. As a consequence, our work resolves the sample and computational complexities of a broad class of inference tasks via a single "meta-algorithm". Moreover, we experimentally demonstrate that our algorithm performs very well in practice. Our algorithm consists of three "levels": (i) At the top level, we employ an iterative greedy algorithm for finding a good partition of the real line into the pieces of a piecewise polynomial. (ii) For each piece, we show that the sub-problem of finding a good polynomial fit on the current interval can be solved efficiently with a separation oracle method. (iii) We reduce the task of finding a separating hyperplane to a combinatorial problem and give an efficient algorithm for this problem. Combining these three procedures gives a density estimation algorithm with the claimed guarantees.
  • A Fibonacci heap is a deterministic data structure implementing a priority queue with optimal amortized operation costs. An unfortunate aspect of Fibonacci heaps is that they must maintain a "mark bit" which serves only to ensure efficiency of heap operations, not correctness. Karger proposed a simple randomized variant of Fibonacci heaps in which mark bits are replaced by coin flips. This variant still has expected amortized cost $O(1)$ for insert, decrease-key, and merge. Karger conjectured that this data structure has expected amortized cost $O(\log s)$ for delete-min, where $s$ is the number of heap operations. We give a tight analysis of Karger's randomized Fibonacci heaps, resolving Karger's conjecture. Specifically, we obtain matching upper and lower bounds of $\Theta(\log^2 s / \log \log s)$ for the runtime of delete-min. We also prove a tight lower bound of $\Omega(\sqrt{n})$ on delete-min in terms of the number of heap elements $n$. The request sequence used to prove this bound also solves an open problem of Fredman on whether cascading cuts are necessary. Finally, we give a simple additional modification to these heaps which yields a tight runtime $O(\log^2 n / \log \log n)$ for delete-min.
  • All consensus hierarchies in the literature assume that we have, in addition to copies of a given object, an unbounded number of registers. But why do we really need these registers? This paper considers what would happen if one attempts to solve consensus using various objects but without any registers. We show that under a reasonable assumption, objects like queues and stacks cannot emulate the missing registers. We also show that, perhaps surprisingly, initialization, shown to have no computational consequences when registers are readily available, is crucial in determining the synchronization power of objects when no registers are allowed. Finally, we show that without registers, the number of available objects affects the level of consensus that can be solved. Our work thus raises the question of whether consensus hierarchies which assume an unbounded number of registers truly capture synchronization power, and begins a line of research aimed at better understanding the interaction between read-write memory and the powerful synchronization operations available on modern architectures.
  • Query evaluation in tuple-independent probabilistic databases is the problem of computing the probability of an answer to a query given independent probabilities of the individual tuples in a database instance. There are two main approaches to this problem: (1) in `grounded inference' one first obtains the lineage for the query and database instance as a Boolean formula, then performs weighted model counting on the lineage (i.e., computes the probability of the lineage given probabilities of its independent Boolean variables); (2) in methods known as `lifted inference' or `extensional query evaluation', one exploits the high-level structure of the query as a first-order formula. Although it is widely believed that lifted inference is strictly more powerful than grounded inference on the lineage alone, no formal separation has previously been shown for query evaluation. In this paper we show such a formal separation for the first time. We exhibit a class of queries for which model counting can be done in polynomial time using extensional query evaluation, whereas the algorithms used in state-of-the-art exact model counters on their lineages provably require exponential time. Our lower bounds on the running times of these exact model counters follow from new exponential size lower bounds on the kinds of d-DNNF representations of the lineages that these model counters (either explicitly or implicitly) produce. Though some of these queries have been studied before, no non-trivial lower bounds on the sizes of these representations for these queries were previously known.
  • The best current methods for exactly computing the number of satisfying assignments, or the satisfying probability, of Boolean formulas can be seen, either directly or indirectly, as building 'decision-DNNF' (decision decomposable negation normal form) representations of the input Boolean formulas. Decision-DNNFs are a special case of 'd-DNNF's where 'd' stands for 'deterministic'. We show that any decision-DNNF can be converted into an equivalent 'FBDD' (free binary decision diagram) -- also known as a 'read-once branching program' (ROBP or 1-BP) -- with only a quasipolynomial increase in representation size in general, and with only a polynomial increase in size in the special case of monotone k-DNF formulas. Leveraging known exponential lower bounds for FBDDs, we then obtain similar exponential lower bounds for decision-DNNFs which provide lower bounds for the recent algorithms. We also separate the power of decision-DNNFs from d-DNNFs and a generalization of decision-DNNFs known as AND-FBDDs. Finally we show how these imply exponential lower bounds for natural problems associated with probabilistic databases.