
We study highdimensional distribution learning in an agnostic setting where
an adversary is allowed to arbitrarily corrupt an $\varepsilon$fraction of the
samples. Such questions have a rich history spanning statistics, machine
learning and theoretical computer science. Even in the most basic settings, the
only known approaches are either computationally inefficient or lose
dimensiondependent factors in their error guarantees. This raises the
following question:Is highdimensional agnostic distribution learning even
possible, algorithmically?
In this work, we obtain the first computationally efficient algorithms with
dimensionindependent error guarantees for agnostically learning several
fundamental classes of highdimensional distributions: (1) a single Gaussian,
(2) a product distribution on the hypercube, (3) mixtures of two product
distributions (under a natural balancedness condition), and (4) mixtures of
spherical Gaussians. Our algorithms achieve error that is independent of the
dimension, and in many cases scales nearlylinearly with the fraction of
adversarially corrupted samples. Moreover, we develop a general recipe for
detecting and correcting corruptions in highdimensions, that may be applicable
to many other problems.

Given samples from an unknown multivariate distribution $p$, is it possible
to distinguish whether $p$ is the product of its marginals versus $p$ being far
from every product distribution? Similarly, is it possible to distinguish
whether $p$ equals a given distribution $q$ versus $p$ and $q$ being far from
each other? These problems of testing independence and goodnessoffit have
received enormous attention in statistics, information theory, and theoretical
computer science, with sampleoptimal algorithms known in several interesting
regimes of parameters. Unfortunately, it has also been understood that these
problems become intractable in large dimensions, necessitating exponential
sample complexity.
Motivated by the exponential lower bounds for general distributions as well
as the ubiquity of Markov Random Fields (MRFs) in the modeling of
highdimensional distributions, we initiate the study of distribution testing
on structured multivariate distributions, and in particular the prototypical
example of MRFs: the Ising Model. We demonstrate that, in this structured
setting, we can avoid the curse of dimensionality, obtaining sample and time
efficient testers for independence and goodnessoffit. One of the key
technical challenges we face along the way is bounding the variance of
functions of the Ising model.

A recent model for property testing of probability distributions (Chakraborty
et al., ITCS 2013, Canonne et al., SICOMP 2015) enables tremendous savings in
the sample complexity of testing algorithms, by allowing them to condition the
sampling on subsets of the domain. In particular, Canonne, Ron, and Servedio
(SICOMP 2015) showed that, in this setting, testing identity of an unknown
distribution $D$ (whether $D=D^\ast$ for an explicitly known $D^\ast$) can be
done with a constant number of queries, independent of the support size $n$ 
in contrast to the required $\Omega(\sqrt{n})$ in the standard sampling model.
It was unclear whether the same stark contrast exists for the case of testing
equivalence, where both distributions are unknown. While Canonne et al.
established a $\mathrm{poly}(\log n)$query upper bound for equivalence
testing, very recently brought down to $\tilde O(\log\log n)$ by Falahatgar et
al. (COLT 2015), whether a dependence on the domain size $n$ is necessary was
still open, and explicitly posed by Fischer at the Bertinoro Workshop on
Sublinear Algorithms (2014). We show that any testing algorithm for equivalence
must make $\Omega(\sqrt{\log\log n})$ queries in the conditional sampling
model. This demonstrates a gap between identity and equivalence testing, absent
in the standard sampling model (where both problems have sampling complexity
$n^{\Theta(1)}$).
We also obtain results on the query complexity of uniformity testing and
supportsize estimation with conditional samples. We answer a question of
Chakraborty et al. (ITCS 2013) showing that nonadaptive uniformity testing
indeed requires $\Omega(\log n)$ queries in the conditional model. For the
related problem of supportsize estimation, we provide both adaptive and
nonadaptive algorithms, with query complexities $\mathrm{poly}(\log\log n)$
and $\mathrm{poly}(\log n)$, respectively.

We design nearly optimal differentially private algorithms for learning two
fundamental families of highdimensional distributions in total variation
distance: multivariate Gaussians in $\mathbb{R}^{d}$ and product distributions
on the hypercube. The sample complexity of both our algorithms approaches the
sample complexity of nonprivate learners up to a small multiplicative factor
and an additional additive term that is lower order for a wide range of
parameters, showing that privacy comes essentially for free for these problems.
Our algorithms use a novel technical approach to reducing the sensitivity of
the estimation procedure that we call recursive private preconditioning and may
find additional applications.

Robust estimation is much more challenging in high dimensions than it is in
one dimension: Most techniques either lead to intractable optimization problems
or estimators that can tolerate only a tiny fraction of errors. Recent work in
theoretical computer science has shown that, in appropriate distributional
models, it is possible to robustly estimate the mean and covariance with
polynomial time algorithms that can tolerate a constant fraction of
corruptions, independent of the dimension. However, the sample and time
complexity of these algorithms is prohibitively large for highdimensional
applications. In this work, we address both of these issues by establishing
sample complexity bounds that are optimal, up to logarithmic factors, as well
as giving various refinements that allow the algorithms to tolerate a much
larger fraction of corruptions. Finally, we show on both synthetic and real
data that our algorithms have stateoftheart performance and suddenly make
highdimensional robust estimation a realistic possibility.

In high dimensions, most machine learning methods are brittle to even a small
fraction of structured outliers. To address this, we introduce a new
metaalgorithm that can take in a base learner such as least squares or
stochastic gradient descent, and harden the learner to be resistant to
outliers. Our method, Sever, possesses strong theoretical guarantees yet is
also highly scalable  beyond running the base learner itself, it only
requires computing the top singular vector of a certain $n \times d$ matrix. We
apply Sever on a drug design dataset and a spam classification dataset, and
find that in both cases it has substantially greater robustness than several
baselines. On the spam dataset, with $1\%$ corruptions, we achieved $7.4\%$
test error, compared to $13.4\%20.5\%$ for the baselines, and $3\%$ error on
the uncorrupted dataset. Similarly, on the drug design dataset, with $10\%$
corruptions, we achieved $1.42$ meansquared error test error, compared to
$1.51$$2.33$ for the baselines, and $1.23$ error on the uncorrupted dataset.

We develop differentially private methods for estimating various
distributional properties. Given a sample from a discrete distribution $p$,
some functional $f$, and accuracy and privacy parameters $\alpha$ and
$\varepsilon$, the goal is to estimate $f(p)$ up to accuracy $\alpha$, while
maintaining $\varepsilon$differential privacy of the sample.
We prove almosttight bounds on the sample size required for this problem for
several functionals of interest, including support size, support coverage, and
entropy. We show that the cost of privacy is negligible in a variety of
settings, both theoretically and experimentally. Our methods are based on a
sensitivity analysis of several stateoftheart methods for estimating these
properties with sublinear sample complexities.

A generative model may generate utter nonsense when it is fit to maximize the
likelihood of observed data. This happens due to "model error," i.e., when the
true data generating distribution does not fit within the class of generative
models being learned. To address this, we propose a model of active
distribution learning using a binary invalidity oracle that identifies some
examples as clearly invalid, together with random positive examples sampled
from the true distribution. The goal is to maximize the likelihood of the
positive examples subject to the constraint of (almost) never generating
examples labeled invalid by the oracle. Guarantees are agnostic compared to a
class of probability distributions. We show that, while proper learning often
requires exponentially many queries to the invalidity oracle, improper
distribution learning can be done using polynomially many queries.

We prove neartight concentration of measure for polynomial functions of the
Ising model under high temperature. For any degree $d$, we show that a
degree$d$ polynomial of a $n$spin Ising model exhibits exponential tails that
scale as $\exp(r^{2/d})$ at radius $r=\tilde{\Omega}_d(n^{d/2})$. Our
concentration radius is optimal up to logarithmic factors for constant $d$,
improving known results by polynomial factors in the number of spins. We
demonstrate the efficacy of polynomial functions as statistics for testing the
strength of interactions in social networks in both synthetic and real world
data.

Given samples from an unknown distribution $p$ and a description of a
distribution $q$, are $p$ and $q$ close or far? This question of "identity
testing" has received significant attention in the case of testing whether $p$
and $q$ are equal or far in total variation distance. However, in recent work,
the following questions have been been critical to solving problems at the
frontiers of distribution testing:
Alternative Distances: Can we test whether $p$ and $q$ are far in other
distances, say Hellinger?
Robustness: Can we test when $p$ and $q$ are close, rather than equal? And
if so, close in which distances?
Motivated by these questions, we characterize the complexity of distribution
testing under a variety of distances, including total variation, $\ell_2$,
Hellinger, KullbackLeibler, and chisquared. For each pair of distances $d_1$
and $d_2$, we study the complexity of testing if $p$ and $q$ are close in $d_1$
versus far in $d_2$, with a focus on identifying which problems allow strongly
sublinear testers (i.e., those with complexity $O(n^{1  \gamma})$ for some
$\gamma > 0$ where $n$ is the size of the support of the distributions $p$ and
$q$). We provide matching upper and lower bounds for each case. We also study
these questions in the case where we only have samples from $q$ (equivalence
testing), showing qualitative differences from identity testing in terms of
when robustness can be achieved. Our algorithms fall into the classical
paradigm of chisquared statistics, but require crucial changes to handle the
challenges introduced by each distance we consider. Finally, we survey other
recent results in an attempt to serve as a reference for the complexity of
various distribution testing problems.

We develop differentially private hypothesis testing methods for the small
sample regime. Given a sample $\cal D$ from a categorical distribution $p$ over
some domain $\Sigma$, an explicitly described distribution $q$ over $\Sigma$,
some privacy parameter $\varepsilon$, accuracy parameter $\alpha$, and
requirements $\beta_{\rm I}$ and $\beta_{\rm II}$ for the type I and type II
errors of our test, the goal is to distinguish between $p=q$ and
$d_{\rm{TV}}(p,q) \geq \alpha$.
We provide theoretical bounds for the sample size ${\cal D}$ so that our
method both satisfies $(\varepsilon,0)$differential privacy, and guarantees
$\beta_{\rm I}$ and $\beta_{\rm II}$ type I and type II errors. We show that
differential privacy may come for free in some regimes of parameters, and we
always beat the sample complexity resulting from running the $\chi^2$test with
noisy counts, or standard approaches such as repetition for endowing
nonprivate $\chi^2$style statistics with differential privacy guarantees. We
experimentally compare the sample complexity of our method to that of recently
proposed methods for private hypothesis testing.

We study the fundamental problem of learning the parameters of a
highdimensional Gaussian in the presence of noise  where an
$\varepsilon$fraction of our samples were chosen by an adversary. We give
robust estimators that achieve estimation error $O(\varepsilon)$ in the total
variation distance, which is optimal up to a universal constant that is
independent of the dimension.
In the case where just the mean is unknown, our robustness guarantee is
optimal up to a factor of $\sqrt{2}$ and the running time is polynomial in $d$
and $1/\epsilon$. When both the mean and covariance are unknown, the running
time is polynomial in $d$ and quasipolynomial in $1/\varepsilon$. Moreover all
of our algorithms require only a polynomial number of samples. Our work shows
that the same sorts of error guarantees that were established over fifty years
ago in the onedimensional setting can also be achieved by efficient algorithms
in highdimensional settings.

An $(n,k)$Poisson Multinomial Distribution (PMD) is the distribution of the
sum of $n$ independent random vectors supported on the set ${\cal
B}_k=\{e_1,\ldots,e_k\}$ of standard basis vectors in $\mathbb{R}^k$. We show
that any $(n,k)$PMD is ${\rm poly}\left({k\over \sigma}\right)$close in total
variation distance to the (appropriately discretized) multidimensional
Gaussian with the same first two moments, removing the dependence on $n$ from
the Central Limit Theorem of Valiant and Valiant. Interestingly, our CLT is
obtained by bootstrapping the ValiantValiant CLT itself through the structural
characterization of PMDs shown in recent work by Daskalakis, Kamath, and
Tzamos. In turn, our stronger CLT can be leveraged to obtain an efficient PTAS
for approximate Nash equilibria in anonymous games, significantly improving the
state of the art, and matching qualitatively the running time dependence on $n$
and $1/\varepsilon$ of the best known algorithm for twostrategy anonymous
games. Our new CLT also enables the construction of covers for the set of
$(n,k)$PMDs, which are proper and whose size is shown to be essentially
optimal. Our cover construction combines our CLT with the ShapleyFolkman
theorem and recent sparsification results for Laplacian matrices by Batson,
Spielman, and Srivastava. Our cover size lower bound is based on an algebraic
geometric construction. Finally, leveraging the structural properties of the
Fourier spectrum of PMDs we show that these distributions can be learned from
$O_k(1/\varepsilon^2)$ samples in ${\rm poly}_k(1/\varepsilon)$time, removing
the quasipolynomial dependence of the running time on $1/\varepsilon$ from the
algorithm of Daskalakis, Kamath, and Tzamos.

Given samples from an unknown distribution $p$, is it possible to distinguish
whether $p$ belongs to some class of distributions $\mathcal{C}$ versus $p$
being far from every distribution in $\mathcal{C}$? This fundamental question
has received tremendous attention in statistics, focusing primarily on
asymptotic analysis, and more recently in information theory and theoretical
computer science, where the emphasis has been on small sample size and
computational complexity. Nevertheless, even for basic properties of
distributions such as monotonicity, logconcavity, unimodality, independence,
and monotonehazard rate, the optimal sample complexity is unknown.
We provide a general approach via which we obtain sampleoptimal and
computationally efficient testers for all these distribution families. At the
core of our approach is an algorithm which solves the following problem: Given
samples from an unknown distribution $p$, and a known distribution $q$, are $p$
and $q$ close in $\chi^2$distance, or far in total variation distance?
The optimality of our testers is established by providing matching lower
bounds with respect to both $n$ and $\varepsilon$. Finally, a necessary
building block for our testers and an important byproduct of our work are the
first known computationally efficient proper learners for discrete logconcave
and monotone hazard rate distributions.

An $(n,k)$Poisson Multinomial Distribution (PMD) is the distribution of the
sum of $n$ independent random vectors supported on the set ${\cal
B}_k=\{e_1,\ldots,e_k\}$ of standard basis vectors in $\mathbb{R}^k$. We prove
a structural characterization of these distributions, showing that, for all
$\varepsilon >0$, any $(n, k)$Poisson multinomial random vector is
$\varepsilon$close, in total variation distance, to the sum of a discretized
multidimensional Gaussian and an independent $(\text{poly}(k/\varepsilon),
k)$Poisson multinomial random vector. Our structural characterization extends
the multidimensional CLT of Valiant and Valiant, by simultaneously applying to
all approximation requirements $\varepsilon$. In particular, it overcomes
factors depending on $\log n$ and, importantly, the minimum eigenvalue of the
PMD's covariance matrix from the distance to a multidimensional Gaussian random
variable.
We use our structural characterization to obtain an $\varepsilon$cover, in
total variation distance, of the set of all $(n, k)$PMDs, significantly
improving the cover size of Daskalakis and Papadimitriou, and obtaining the
same qualitative dependence of the cover size on $n$ and $\varepsilon$ as the
$k=2$ cover of Daskalakis and Papadimitriou. We further exploit this structure
to show that $(n,k)$PMDs can be learned to within $\varepsilon$ in total
variation distance from $\tilde{O}_k(1/\varepsilon^2)$ samples, which is
nearoptimal in terms of dependence on $\varepsilon$ and independent of $n$. In
particular, our result generalizes the singledimensional result of Daskalakis,
Diakonikolas, and Servedio for Poisson Binomials to arbitrary dimension.

We provide an algorithm for properly learning mixtures of two
singledimensional Gaussians without any separability assumptions. Given
$\tilde{O}(1/\varepsilon^2)$ samples from an unknown mixture, our algorithm
outputs a mixture that is $\varepsilon$close in total variation distance, in
time $\tilde{O}(1/\varepsilon^5)$. Our sample complexity is optimal up to
logarithmic factors, and significantly improves upon both Kalai et al., whose
algorithm has a prohibitive dependence on $1/\varepsilon$, and Feldman et al.,
whose algorithm requires bounds on the mixture parameters and depends
pseudopolynomially in these parameters.
One of our main contributions is an improved and generalized algorithm for
selecting a good candidate distribution from among competing hypotheses.
Namely, given a collection of $N$ hypotheses containing at least one candidate
that is $\varepsilon$close to an unknown distribution, our algorithm outputs a
candidate which is $O(\varepsilon)$close to the distribution. The algorithm
requires ${O}(\log{N}/\varepsilon^2)$ samples from the unknown distribution and
${O}(N \log N/\varepsilon^2)$ time, which improves previous such results (such
as the Scheff\'e estimator) from a quadratic dependence of the running time on
$N$ to quasilinear. Given the wide use of such results for the purpose of
hypothesis selection, our improved algorithm implies immediate improvements to
any such use.

We analyze the Schelling model of segregation in which a society of n
individuals live in a ring. Each individual is one of two races and is only
satisfied with his location so long as at least half his 2w nearest neighbors
are of the same race as him. In the dynamics, randomlychosen unhappy
individuals successively swap locations. We consider the average size of
monochromatic neighborhoods in the final stable state. Our analysis is the
first rigorous analysis of the Schelling dynamics. We note that, in contrast to
prior approximate analyses, the final state is nearly integrated: the average
size of monochromatic neighborhoods is independent of n and polynomial in w.