
We study highdimensional distribution learning in an agnostic setting where
an adversary is allowed to arbitrarily corrupt an $\varepsilon$fraction of the
samples. Such questions have a rich history spanning statistics, machine
learning and theoretical computer science. Even in the most basic settings, the
only known approaches are either computationally inefficient or lose
dimensiondependent factors in their error guarantees. This raises the
following question:Is highdimensional agnostic distribution learning even
possible, algorithmically?
In this work, we obtain the first computationally efficient algorithms with
dimensionindependent error guarantees for agnostically learning several
fundamental classes of highdimensional distributions: (1) a single Gaussian,
(2) a product distribution on the hypercube, (3) mixtures of two product
distributions (under a natural balancedness condition), and (4) mixtures of
spherical Gaussians. Our algorithms achieve error that is independent of the
dimension, and in many cases scales nearlylinearly with the fraction of
adversarially corrupted samples. Moreover, we develop a general recipe for
detecting and correcting corruptions in highdimensions, that may be applicable
to many other problems.

Here we study the problem of sampling random proper colorings of a bounded
degree graph. Let $k$ be the number of colors and let $d$ be the maximum
degree. In 1999, Vigoda showed that the Glauber dynamics is rapidly mixing for
any $k > \frac{11}{6} d$. It turns out that there is a natural barrier at
$\frac{11}{6}$, below which there is no onestep coupling that is contractive,
even for the flip dynamics.
We use linear programming and duality arguments to guide our construction of
a better coupling. We fully characterize the obstructions to going beyond
$\frac{11}{6}$. These examples turn out to be quite brittle, and even starting
from one, they are likely to break apart before the flip dynamics changes the
distance between two neighboring colorings. We use this intuition to design a
variable length coupling that shows that the Glauber dynamics is rapidly mixing
for any $k\ge \left(\frac{11}{6}  \epsilon_0\right)d$ where $\epsilon_0 \geq
9.4 \cdot 10^{5}$. This is the first improvement to Vigoda's analysis that
holds for general graphs.

Learning mixtures of $k$ binary product distributions is a central problem in
computational learning theory, but one where there are wide gaps between the
best known algorithms and lower bounds (even for restricted families of
algorithms). We narrow many of these gaps by developing novel insights about
how to reason about higher order multilinear moments. Our results include:
1) An $n^{O(k^2)}$ time algorithm for learning mixtures of binary product
distributions, giving the first improvement on the $n^{O(k^3)}$ time algorithm
of Feldman, O'Donnell and Servedio
2) An $n^{\Omega(\sqrt{k})}$ statistical query lower bound, improving on the
$n^{\Omega(\log k)}$ lower bound that is based on connections to sparse parity
with noise
3) An $n^{O(\log k)}$ time algorithm for learning mixtures of $k$ subcubes.
This special case can still simulate many other hard learning problems, but is
much richer than any of them alone. As a corollary, we obtain more flexible
algorithms for learning decision trees under the uniform distribution, that
work with stochastic transitions, when we are only given positive examples and
with a polylogarithmic number of samples for any fixed $k$.
Our algorithms are based on a winwin analysis where we either build a basis
for the moments or locate a degeneracy that can be used to simplify the
problem, which we believe will have applications to other learning problems
over discrete domains.

Robust estimation is much more challenging in high dimensions than it is in
one dimension: Most techniques either lead to intractable optimization problems
or estimators that can tolerate only a tiny fraction of errors. Recent work in
theoretical computer science has shown that, in appropriate distributional
models, it is possible to robustly estimate the mean and covariance with
polynomial time algorithms that can tolerate a constant fraction of
corruptions, independent of the dimension. However, the sample and time
complexity of these algorithms is prohibitively large for highdimensional
applications. In this work, we address both of these issues by establishing
sample complexity bounds that are optimal, up to logarithmic factors, as well
as giving various refinements that allow the algorithms to tolerate a much
larger fraction of corruptions. Finally, we show on both synthetic and real
data that our algorithms have stateoftheart performance and suddenly make
highdimensional robust estimation a realistic possibility.

Determinantal point processes (DPPs) have wideranging applications in
machine learning, where they are used to enforce the notion of diversity in
subset selection problems. Many estimators have been proposed, but surprisingly
the basic properties of the maximum likelihood estimator (MLE) have received
little attention. In this paper, we study the local geometry of the expected
loglikelihood function to prove several rates of convergence for the MLE. We
also give a complete characterization of the case where the MLE converges at a
parametric rate. Even in the latter case, we also exhibit a potential curse of
dimensionality where the asymptotic variance of the MLE is exponentially large
in the dimension of the problem.

Determinantal point processes (DPPs) have wideranging applications in
machine learning, where they are used to enforce the notion of diversity in
subset selection problems. Many estimators have been proposed, but surprisingly
the basic properties of the maximum likelihood estimator (MLE) have received
little attention. The difficulty is that it is a nonconcave maximization
problem, and such functions are notoriously difficult to understand in high
dimensions, despite their importance in modern machine learning. Here we study
both the local and global geometry of the expected loglikelihood function. We
prove several rates of convergence for the MLE and give a complete
characterization of the case where these are parametric. We also exhibit a
potential curse of dimensionality where the asymptotic variance of the MLE
scales exponentially with the dimension of the problem. Moreover, we exhibit an
exponential number of saddle points, and give evidence that these may be the
only critical points.

Markov random fields area popular model for highdimensional probability
distributions. Over the years, many mathematical, statistical and algorithmic
problems on them have been studied. Until recently, the only known algorithms
for provably learning them relied on exhaustive search, correlation decay or
various incoherence assumptions. Bresler gave an algorithm for learning general
Ising models on bounded degree graphs. His approach was based on a structural
result about mutual information in Ising models.
Here we take a more conceptual approach to proving lower bounds on the mutual
information through setting up an appropriate zerosum game. Our proof
generalizes well beyond Ising models, to arbitrary Markov random fields with
higher order interactions. As an application, we obtain algorithms for learning
Markov random fields on bounded degree graphs on $n$ nodes with $r$order
interactions in $n^r$ time and $\log n$ sample complexity. The sample
complexity is information theoretically optimal up to the dependence on the
maximum degree. The running time is nearly optimal under standard conjectures
about the hardness of learning parity with noise.

We study the fundamental problem of learning the parameters of a
highdimensional Gaussian in the presence of noise  where an
$\varepsilon$fraction of our samples were chosen by an adversary. We give
robust estimators that achieve estimation error $O(\varepsilon)$ in the total
variation distance, which is optimal up to a universal constant that is
independent of the dimension.
In the case where just the mean is unknown, our robustness guarantee is
optimal up to a factor of $\sqrt{2}$ and the running time is polynomial in $d$
and $1/\epsilon$. When both the mean and covariance are unknown, the running
time is polynomial in $d$ and quasipolynomial in $1/\varepsilon$. Moreover all
of our algorithms require only a polynomial number of samples. Our work shows
that the same sorts of error guarantees that were established over fifty years
ago in the onedimensional setting can also be achieved by efficient algorithms
in highdimensional settings.

In this paper we introduce a new approach for approximately counting in
bounded degree systems with higherorder constraints. Our main result is an
algorithm to approximately count the number of solutions to a CNF formula
$\Phi$ when the width is logarithmic in the maximum degree. This closes an
exponential gap between the known upper and lower bounds.
Moreover our algorithm extends straightforwardly to approximate sampling,
which shows that under Lov\'asz Local Lemmalike conditions it is not only
possible to find a satisfying assignment, it is also possible to generate one
approximately uniformly at random from the set of all satisfying assignments.
Our approach is a significant departure from earlier techniques in approximate
counting, and is based on a framework to bootstrap an oracle for computing
marginal probabilities on individual variables. Finally, we give an application
of our results to show that it is algorithmically possible to sample from the
posterior distribution in an interesting class of graphical models.

Determinantal Point Processes (DPPs) are a family of probabilistic models
that have a repulsive behavior, and lend themselves naturally to many tasks in
machine learning where returning a diverse set of objects is important. While
there are fast algorithms for sampling, marginalization and conditioning, much
less is known about learning the parameters of a DPP. Our contribution is
twofold: (i) we establish the optimal sample complexity achievable in this
problem and show that it is governed by a natural parameter, which we call the
\emph{cycle sparsity}; (ii) we propose a provably fast combinatorial algorithm
that implements the method of moments efficiently and achieves optimal sample
complexity. Finally, we give experimental results that confirm our theoretical
findings.

A central problem of random matrix theory is to understand the eigenvalues of
spiked random matrix models, in which a prominent eigenvector is planted into a
random matrix. These distributions form natural statistical models for
principal component analysis (PCA) problems throughout the sciences. Baik, Ben
Arous and P\'ech\'e showed that the spiked Wishart ensemble exhibits a sharp
phase transition asymptotically: when the signal strength is above a critical
threshold, it is possible to detect the presence of a spike based on the top
eigenvalue, and below the threshold the top eigenvalue provides no information.
Such results form the basis of our understanding of when PCA can detect a
lowrank signal in the presence of noise.
However, not all the information about the spike is necessarily contained in
the spectrum. We study the fundamental limitations of statistical methods,
including nonspectral ones. Our results include:
I) For the Gaussian Wigner ensemble, we show that PCA achieves the optimal
detection threshold for a variety of benign priors for the spike. We extend
previous work on the spherically symmetric and i.i.d. Rademacher priors through
an elementary, unified analysis.
II) For any nonGaussian Wigner ensemble, we show that PCA is always
suboptimal for detection. However, a variant of PCA achieves the optimal
threshold (for benign priors) by pretransforming the matrix entries according
to a carefully designed function. This approach has been stated before, and we
give a rigorous and general analysis.
III) For both the Gaussian Wishart ensemble and various synchronization
problems over groups, we show that inefficient procedures can work below the
threshold where PCA succeeds, whereas no known efficient algorithm achieves
this. This conjectural gap between what is statistically possible and what can
be done efficiently remains open.

Various alignment problems arising in cryoelectron microscopy, community
detection, time synchronization, computer vision, and other fields fall into a
common framework of synchronization problems over compact groups such as Z/L,
U(1), or SO(3). The goal of such problems is to estimate an unknown vector of
group elements given noisy relative observations. We present an efficient
iterative algorithm to solve a large class of these problems, allowing for any
compact group, with measurements on multiple 'frequency channels' (Fourier
modes, or more generally, irreducible representations of the group). Our
algorithm is a highly efficient iterative method following the blueprint of
approximate message passing (AMP), which has recently arisen as a central
technique for inference problems such as structured lowrank estimation and
compressed sensing. We augment the standard ideas of AMP with ideas from
representation theory so that the algorithm can work with distributions over
compact groups. Using standard but nonrigorous methods from statistical
physics we analyze the behavior of our algorithm on a Gaussian noise model,
identifying phases where the problem is easy, (computationally) hard, and
(statistically) impossible. In particular, such evidence predicts that our
algorithm is informationtheoretically optimal in many cases, and that the
remaining cases show evidence of statisticaltocomputational gaps.

Recently, there has been considerable progress on designing algorithms with
provable guarantees  typically using linear algebraic methods  for
parameter learning in latent variable models. But designing provable algorithms
for inference has proven to be more challenging. Here we take a first step
towards provable inference in topic models. We leverage a property of topic
models that enables us to construct simple linear estimators for the unknown
topic proportions that have small variance, and consequently can work with
short documents. Our estimators also correspond to finding an estimate around
which the posterior is wellconcentrated. We show lower bounds that for shorter
documents it can be information theoretically impossible to find the hidden
topics. Finally, we give empirical results that demonstrate that our algorithm
works on realistic topic models. It yields good solutions on synthetic data and
runs in time comparable to a {\em single} iteration of Gibbs sampling.

We prove that with high probability over the choice of a random graph $G$
from the Erd\H{o}sR\'enyi distribution $G(n,1/2)$, the $n^{O(d)}$time degree
$d$ SumofSquares semidefinite programming relaxation for the clique problem
will give a value of at least $n^{1/2c(d/\log n)^{1/2}}$ for some constant
$c>0$. This yields a nearly tight $n^{1/2  o(1)}$ bound on the value of this
program for any degree $d = o(\log n)$. Moreover we introduce a new framework
that we call \emph{pseudocalibration} to construct Sum of Squares lower
bounds. This framework is inspired by taking a computational analog of Bayesian
probability theory. It yields a general recipe for constructing good
pseudodistributions (i.e., dual certificates for the SumofSquares
semidefinite program), and sheds further light on the ways in which this
hierarchy differs from others.

The stochastic block model is one of the oldest and most ubiquitous models
for studying clustering and community detection. In an exciting sequence of
developments, motivated by deep but nonrigorous ideas from statistical
physics, Decelle et al. conjectured a sharp threshold for when community
detection is possible in the sparse regime. Mossel, Neeman and Sly and
Massoulie proved the conjecture and gave matching algorithms and lower bounds.
Here we revisit the stochastic block model from the perspective of semirandom
models where we allow an adversary to make `helpful' changes that strengthen
ties within each community and break ties between them. We show a surprising
result that these `helpful' changes can shift the informationtheoretic
threshold, making the community detection problem strictly harder. We
complement this by showing that an algorithm based on semidefinite programming
(which was known to get close to the threshold) continues to work in the
semirandom model (even for partial recovery). This suggests that algorithms
based on semidefinite programming are robust in ways that any algorithm meeting
the informationtheoretic threshold cannot be.
These results point to an interesting new direction: Can we find robust,
semirandom analogues to some of the classical, averagecase thresholds in
statistics? We also explore this question in the broadcast tree model, and we
show that the viewpoint of semirandom models can help explain why some
algorithms are preferred to others in practice, in spite of the gaps in their
statistical performance on random models.

In the noisy tensor completion problem we observe $m$ entries (whose location
is chosen uniformly at random) from an unknown $n_1 \times n_2 \times n_3$
tensor $T$. We assume that $T$ is entrywise close to being rank $r$. Our goal
is to fill in its missing entries using as few observations as possible. Let $n
= \max(n_1, n_2, n_3)$. We show that if $m = n^{3/2} r$ then there is a
polynomial time algorithm based on the sixth level of the sumofsquares
hierarchy for completing it. Our estimate agrees with almost all of $T$'s
entries almost exactly and works even when our observations are corrupted by
noise. This is also the first algorithm for tensor completion that works in the
overcomplete case when $r > n$, and in fact it works all the way up to $r =
n^{3/2\epsilon}$.
Our proofs are short and simple and are based on establishing a new
connection between noisy tensor completion (through the language of Rademacher
complexity) and the task of refuting random constant satisfaction problems.
This connection seems to have gone unnoticed even in the context of matrix
completion. Furthermore, we use this connection to show matching lower bounds.
Our main technical result is in characterizing the Rademacher complexity of the
sequence of norms that arise in the sumofsquares relaxations to the tensor
nuclear norm. These results point to an interesting new direction: Can we
explore computational vs. sample complexity tradeoffs through the
sumofsquares hierarchy?

We show that for any odd $k$ and any instance of the MaxkXOR constraint
satisfaction problem, there is an efficient algorithm that finds an assignment
satisfying at least a $\frac{1}{2} + \Omega(1/\sqrt{D})$ fraction of
constraints, where $D$ is a bound on the number of constraints that each
variable occurs in. This improves both qualitatively and quantitatively on the
recent work of Farhi, Goldstone, and Gutmann (2014), which gave a
\emph{quantum} algorithm to find an assignment satisfying a $\frac{1}{2} +
\Omega(D^{3/4})$ fraction of the equations.
For arbitrary constraint satisfaction problems, we give a similar result for
"trianglefree" instances; i.e., an efficient algorithm that finds an
assignment satisfying at least a $\mu + \Omega(1/\sqrt{D})$ fraction of
constraints, where $\mu$ is the fraction that would be satisfied by a uniformly
random assignment.

Superresolution is a fundamental task in imaging, where the goal is to
extract finegrained structure from coarsegrained measurements. Here we are
interested in a popular mathematical abstraction of this problem that has been
widely studied in the statistics, signal processing and machine learning
communities. We exactly resolve the threshold at which noisy superresolution
is possible. In particular, we establish a sharp phase transition for the
relationship between the cutoff frequency ($m$) and the separation ($\Delta$).
If $m > 1/\Delta + 1$, our estimator converges to the true values at an inverse
polynomial rate in terms of the magnitude of the noise. And when $m <
(1\epsilon) /\Delta$ no estimator can distinguish between a particular pair of
$\Delta$separated signals even if the magnitude of the noise is exponentially
small.
Our results involve making novel connections between {\em extremal functions}
and the spectral properties of Vandermonde matrices. We establish a sharp phase
transition for their condition number which in turn allows us to give the first
noise tolerance bounds for the matrix pencil method. Moreover we show that our
methods can be interpreted as giving preconditioners for Vandermonde matrices,
and we use this observation to design faster algorithms for superresolution.
We believe that these ideas may have other applications in designing faster
algorithms for other basic tasks in signal processing.

Sparse coding is a basic task in many fields including signal processing,
neuroscience and machine learning where the goal is to learn a basis that
enables a sparse representation of a given set of data, if one exists. Its
standard formulation is as a nonconvex optimization problem which is solved in
practice by heuristics based on alternating minimization. Re cent work has
resulted in several algorithms for sparse coding with provable guarantees, but
somewhat surprisingly these are outperformed by the simple alternating
minimization heuristics. Here we give a general framework for understanding
alternating minimization which we leverage to analyze existing heuristics and
to design new ones also with provable guarantees. Some of these algorithms seem
implementable on simple neural architectures, which was the original motivation
of Olshausen and Field (1997a) in introducing sparse coding. We also give the
first efficient algorithm for sparse coding that works almost up to the
information theoretic limit for sparse recovery on incoherent dictionaries. All
previous algorithms that approached or surpassed this limit run in time
exponential in some natural parameter. Finally, our algorithms improve upon the
sample complexity of existing approaches. We believe that our analysis
framework will have applications in other settings where simple iterative
algorithms are used.

In sparse recovery we are given a matrix $A$ (the dictionary) and a vector of
the form $A X$ where $X$ is sparse, and the goal is to recover $X$. This is a
central notion in signal processing, statistics and machine learning. But in
applications such as sparse coding, edge detection, compression and super
resolution, the dictionary $A$ is unknown and has to be learned from random
examples of the form $Y = AX$ where $X$ is drawn from an appropriate
distribution  this is the dictionary learning problem. In most settings, $A$
is overcomplete: it has more columns than rows. This paper presents a
polynomialtime algorithm for learning overcomplete dictionaries; the only
previously known algorithm with provable guarantees is the recent work of
Spielman, Wang and Wright who gave an algorithm for the fullrank case, which
is rarely the case in applications. Our algorithm applies to incoherent
dictionaries which have been a central object of study since they were
introduced in seminal work of Donoho and Huo. In particular, a dictionary is
$\mu$incoherent if each pair of columns has inner product at most $\mu /
\sqrt{n}$.
The algorithm makes natural stochastic assumptions about the unknown sparse
vector $X$, which can contain $k \leq c \min(\sqrt{n}/\mu \log n, m^{1/2
\eta})$ nonzero entries (for any $\eta > 0$). This is close to the best $k$
allowable by the best sparse recovery algorithms even if one knows the
dictionary $A$ exactly. Moreover, both the running time and sample complexity
depend on $\log 1/\epsilon$, where $\epsilon$ is the target accuracy, and so
our algorithms converge very quickly to the true dictionary. Our algorithm can
also tolerate substantial amounts of noise provided it is incoherent with
respect to the dictionary (e.g., Gaussian). In the noisy setting, our running
time and sample complexity depend polynomially on $1/\epsilon$, and this is
necessary.

Low rank tensor decompositions are a powerful tool for learning generative
models, and uniqueness results give them a significant advantage over matrix
decomposition methods. However, tensors pose significant algorithmic challenges
and tensors analogs of much of the matrix algebra toolkit are unlikely to exist
because of hardness results. Efficient decomposition in the overcomplete case
(where rank exceeds dimension) is particularly challenging. We introduce a
smoothed analysis model for studying these questions and develop an efficient
algorithm for tensor decomposition in the highly overcomplete case (rank
polynomial in the dimension). In this setting, we show that our algorithm is
robust to inverse polynomial error  a crucial property for applications in
learning since we are only allowed a polynomial number of samples. While
algorithms are known for exact tensor decomposition in some overcomplete
settings, our main contribution is in analyzing their stability in the
framework of smoothed analysis.
Our main technical contribution is to show that tensor products of perturbed
vectors are linearly independent in a robust sense (i.e. the associated matrix
has singular values that are at least an inverse polynomial). This key result
paves the way for applying tensor methods to learning problems in the smoothed
setting. In particular, we use it to obtain results for learning multiview
models and mixtures of axisaligned Gaussians where there are many more
"components" than dimensions. The assumption here is that the model is not
adversarially chosen, formalized by a perturbation of model parameters. We
believe this an appealing way to analyze realistic instances of learning
problems, since this framework allows us to overcome many of the usual
limitations of using tensor methods.

We consider a fundamental problem in unsupervised learning called
\emph{subspace recovery}: given a collection of $m$ points in $\mathbb{R}^n$,
if many but not necessarily all of these points are contained in a
$d$dimensional subspace $T$ can we find it? The points contained in $T$ are
called {\em inliers} and the remaining points are {\em outliers}. This problem
has received considerable attention in computer science and in statistics. Yet
efficient algorithms from computer science are not robust to {\em adversarial}
outliers, and the estimators from robust statistics are hard to compute in high
dimensions.
Are there algorithms for subspace recovery that are both robust to outliers
and efficient? We give an algorithm that finds $T$ when it contains more than a
$\frac{d}{n}$ fraction of the points. Hence, for say $d = n/2$ this estimator
is both easy to compute and wellbehaved when there are a constant fraction of
outliers. We prove that it is Small Set Expansion hard to find $T$ when the
fraction of errors is any larger, thus giving evidence that our estimator is an
{\em optimal} compromise between efficiency and robustness.
As it turns out, this basic problem has a surprising number of connections to
other areas including small set expansion, matroid theory and functional
analysis that we make use of here.

We consider a problem which has received considerable attention in systems
literature because of its applications to routing in delay tolerant networks
and replica placement in distributed storage systems. In abstract terms the
problem can be stated as follows: Given a random variable $X$ generated by a
known product distribution over $\{0,1\}^n$ and a target value $0 \leq \theta
\leq 1$, output a nonnegative vector $w$, with $\w\_1 \le 1$, which
maximizes the probability of the event $w \cdot X \ge \theta$. This is a
challenging nonconvex optimization problem for which even computing the value
$\Pr[w \cdot X \ge \theta]$ of a proposed solution vector $w$ is #Phard.
We provide an additive EPTAS for this problem which, for constantbounded
product distributions, runs in $ \poly(n) \cdot 2^{\poly(1/\eps)}$ time and
outputs an $\eps$approximately optimal solution vector $w$ for this problem.
Our approach is inspired by, and extends, recent structural results from the
complexitytheoretic study of linear threshold functions. Furthermore, in spite
of the objective function being nonsmooth, we give a \emph{unicriterion} PTAS
while previous work for such objective functions has typically led to a
\emph{bicriterion} PTAS. We believe our techniques may be applicable to get
unicriterion PTAS for other nonsmooth objective functions.

We give a polynomial time algorithm for the lossy population recovery
problem. In this problem, the goal is to approximately learn an unknown
distribution on binary strings of length $n$ from lossy samples: for some
parameter $\mu$ each coordinate of the sample is preserved with probability
$\mu$ and otherwise is replaced by a `?'. The running time and number of
samples needed for our algorithm is polynomial in $n$ and $1/\varepsilon$ for
each fixed $\mu>0$. This improves on algorithm of Wigderson and Yehudayoff that
runs in quasipolynomial time for any $\mu > 0$ and the polynomial time
algorithm of Dvir et al which was shown to work for $\mu \gtrapprox 0.30$ by
Batman et al. In fact, our algorithm also works in the more general framework
of Batman et al. in which there is no a priori bound on the size of the support
of the distribution. The algorithm we analyze is implicit in previous work; our
main contribution is to analyze the algorithm by showing (via linear
programming duality and connections to complex analysis) that a certain matrix
associated with the problem has a robust local inverse even though its
condition number is exponentially small. A corollary of our result is the first
polynomial time algorithm for learning DNFs in the restriction access model of
Dvir et al.

Topic models provide a useful method for dimensionality reduction and
exploratory data analysis in large text corpora. Most approaches to topic model
inference have been based on a maximum likelihood objective. Efficient
algorithms exist that approximate this objective, but they have no provable
guarantees. Recently, algorithms have been introduced that provide provable
bounds, but these algorithms are not practical because they are inefficient and
not robust to violations of model assumptions. In this paper we present an
algorithm for topic model inference that is both provable and practical. The
algorithm produces results comparable to the best MCMC implementations while
running orders of magnitude faster.