
We prove that stochastic gradient descent efficiently converges to the global
optimizer of the maximum likelihood objective of an unknown linear
timeinvariant dynamical system from a sequence of noisy observations generated
by the system. Even though the objective function is nonconvex, we provide
polynomial running time and sample complexity bounds under strong but natural
assumptions. Linear systems identification has been studied for many decades,
yet, to the best of our knowledge, these are the first polynomial guarantees
for the problem we consider.

Word embeddings are ubiquitous in NLP and information retrieval, but it is
unclear what they represent when the word is polysemous. Here it is shown that
multiple word senses reside in linear superposition within the word embedding
and simple sparse coding can recover vectors that approximately capture the
senses. The success of our approach, which applies to several embedding
methods, is mathematically explained using a variant of the random walk on
discourses model (Arora et al., 2016). A novel aspect of our technique is that
each extracted word sense is accompanied by one of about 2000 "discourse atoms"
that gives a succinct description of which other words cooccur with that word
sense. Discourse atoms can be of independent interest, and make the method
potentially more useful. Empirical tests are used to verify and support the
theory.

Matrix completion is a basic machine learning problem that has wide
applications, especially in collaborative filtering and recommender systems.
Simple nonconvex optimization algorithms are popular and effective in
practice. Despite recent progress in proving various nonconvex algorithms
converge from a good initial point, it remains unclear why random or arbitrary
initialization suffices in practice. We prove that the commonly used nonconvex
objective function for \textit{positive semidefinite} matrix completion has no
spurious local minima  all local minima must also be global. Therefore, many
popular optimization algorithms such as (stochastic) gradient descent can
provably solve positive semidefinite matrix completion with \textit{arbitrary}
initialization in polynomial time. The result can be generalized to the setting
when the observed entries contain noise. We believe that our main proof
strategy can be useful for understanding geometric properties of other
statistical problems involving partial or noisy observations.

An emerging design principle in deep learning is that each layer of a deep
artificial neural network should be able to easily express the identity
transformation. This idea not only motivated various normalization techniques,
such as \emph{batch normalization}, but was also key to the immense success of
\emph{residual networks}.
In this work, we put the principle of \emph{identity parameterization} on a
more solid theoretical footing alongside further empirical progress. We first
give a strikingly simple proof that arbitrarily deep linear residual networks
have no spurious local optima. The same result for linear feedforward networks
in their standard parameterization is substantially more delicate. Second, we
show that residual networks with ReLu activations have universal finitesample
expressivity in the sense that the network can represent any function of its
sample provided that the model has more parameters than the sample size.
Directly inspired by our theory, we experiment with a radically simple
residual architecture consisting of only residual convolutional layers and ReLu
activations, but no batch normalization, dropout, or max pool. Our model
improves significantly on previous allconvolutional networks on the CIFAR10,
CIFAR100, and ImageNet classification benchmarks.

We show that the gradient descent algorithm provides an implicit
regularization effect in the learning of overparameterized matrix
factorization models and onehiddenlayer neural networks with quadratic
activations. Concretely, we show that given $\tilde{O}(dr^{2})$ random linear
measurements of a rank $r$ positive semidefinite matrix $X^{\star}$, we can
recover $X^{\star}$ by parameterizing it by $UU^\top$ with $U\in \mathbb
R^{d\times d}$ and minimizing the squared loss, even if $r \ll d$. We prove
that starting from a small initialization, gradient descent recovers
$X^{\star}$ in $\tilde{O}(\sqrt{r})$ iterations approximately. The results
solve the conjecture of Gunasekar et al.'17 under the restricted isometry
property. The technique can be applied to analyzing neural networks with
onehiddenlayer quadratic activations with some technical modifications.

We show that training of generative adversarial network (GAN) may not have
good generalization properties; e.g., training may appear successful but the
trained distribution may be far from target distribution in standard metrics.
However, generalization does occur for a weaker metric called neural net
distance. It is also shown that an approximate pure equilibrium exists in the
discriminator/generator game for a special class of generators with natural
training objectives when generator capacity and training set sizes are
moderate.
This existence of equilibrium inspires MIX+GAN protocol, which can be
combined with any existing GAN training, and empirically shown to improve some
of them.

Nonconvex optimization with local search heuristics has been widely used in
machine learning, achieving many stateofart results. It becomes increasingly
important to understand why they can work for these NPhard problems on typical
data. The landscape of many objective functions in learning has been
conjectured to have the geometric property that "all local optima are
(approximately) global optima", and thus they can be solved efficiently by
local search algorithms. However, establishing such property can be very
difficult.
In this paper, we analyze the optimization landscape of the random
overcomplete tensor decomposition problem, which has many applications in
unsupervised learning, especially in learning latent variable models. In
practice, it can be efficiently solved by gradient ascent on a nonconvex
objective. We show that for any small constant $\epsilon > 0$, among the set of
points with function values $(1+\epsilon)$factor larger than the expectation
of the function, all the local maxima are approximate global maxima.
Previously, the bestknown result only characterizes the geometry in small
neighborhoods around the true components. Our result implies that even with an
initialization that is barely better than the random guess, the gradient ascent
algorithm is guaranteed to solve this problem.
Our main technique uses KacRice formula and random matrix theory. To our
best knowledge, this is the first time when KacRice formula is successfully
applied to counting the number of local minima of a highlystructured random
polynomial with dependent coefficients.

Deep neural nets have caused a revolution in many classification tasks. A
related ongoing revolutionalso theoretically not understoodconcerns their
ability to serve as generative models for complicated types of data such as
images and texts. These models are trained using ideas like variational
autoencoders and Generative Adversarial Networks.
We take a first cut at explaining the expressivity of multilayer nets by
giving a sufficient criterion for a function to be approximable by a neural
network with $n$ hidden layers. A key ingredient is Barron's Theorem
\cite{Barron1993}, which gives a Fourier criterion for approximability of a
function by a neural network with 1 hidden layer. We show that a composition of
$n$ functions which satisfy certain Fourier conditions ("Barron functions") can
be approximated by a $n+1$layer neural network.
For probability distributions, this translates into a criterion for a
probability distribution to be approximable in Wasserstein distancea natural
metric on probability distributionsby a neural network applied to a fixed
base distribution (e.g., multivariate gaussian).
Building up recent lower bound work, we also give an example function that
shows that composition of Barron functions is more expressive than Barron
functions alone.

We design a nonconvex secondorder optimization algorithm that is guaranteed
to return an approximate local minimum in time which scales linearly in the
underlying dimension and the number of training examples. The time complexity
of our algorithm to find an approximate local minimum is even faster than that
of gradient descent to find a critical point. Our algorithm applies to a
general class of optimization problems including training a neural network and
other nonconvex objectives arising in machine learning.

Many machine learning applications use latent variable models to explain
structure in data, whereby visible variables (= coordinates of the given
datapoint) are explained as a probabilistic function of some hidden variables.
Finding parameters with the maximum likelihood is NPhard even in very simple
settings. In recent years, provably efficient algorithms were nevertheless
developed for models with linear structures: topic models, mixture models,
hidden markov models, etc. These algorithms use matrix or tensor decomposition,
and make some reasonable assumptions about the parameters of the underlying
model.
But matrix or tensor decomposition seems of little use when the latent
variable model has nonlinearities. The current paper shows how to make
progress: tensor decomposition is applied for learning the singlelayer {\em
noisy or} network, which is a textbook example of a Bayes net, and used for
example in the classic QMRDT software for diagnosing which disease(s) a
patient may have by observing the symptoms he/she exhibits.
The technical novelty here, which should be useful in other settings in
future, is analysis of tensor decomposition in presence of systematic error
(i.e., where the noise/error is correlated with the signal, and doesn't
decrease as number of samples goes to infinity). This requires rethinking all
steps of tensor decomposition methods from the ground up.
For simplicity our analysis is stated assuming that the network parameters
were chosen from a probability distribution but the method seems more generally
applicable.

We give a novel formal theoretical framework for unsupervised learning with
two distinctive characteristics. First, it does not assume any generative model
and based on a worstcase performance metric. Second, it is comparative, namely
performance is measured with respect to a given hypothesis class. This allows
to avoid known computational hardness results and improper algorithms based on
convex relaxations. We show how several families of unsupervised learning
models, which were previously only analyzed under probabilistic assumptions and
are otherwise provably intractable, can be efficiently learned in our framework
by convex optimization.

We give new algorithms based on the sumofsquares method for tensor
decomposition. Our results improve the best known running times from
quasipolynomial to polynomial for several problems, including decomposing
random overcomplete 3tensors and learning overcomplete dictionaries with
constant relative sparsity. We also give the first robust analysis for
decomposing overcomplete 4tensors in the smoothed analysis model. A key
ingredient of our analysis is to establish small spectral gaps in moment
matrices derived from solutions to sumofsquares relaxations. To enable this
analysis we augment sumofsquares relaxations with spectral analogs of maximum
entropy constraints.

Semantic word embeddings represent the meaning of a word via a vector, and
are created by diverse methods. Many use nonlinear operations on cooccurrence
statistics, and have handtuned hyperparameters and reweighting methods.
This paper proposes a new generative model, a dynamic version of the
loglinear topic model of~\citet{mnih2007three}. The methodological novelty is
to use the prior to compute closed form expressions for word statistics. This
provides a theoretical justification for nonlinear models like PMI, word2vec,
and GloVe, as well as some hyperparameter choices. It also helps explain why
lowdimensional semantic embeddings contain linear algebraic structure that
allows solution of word analogies, as shown by~\citet{mikolov2013efficient} and
many subsequent papers.
Experimental support is provided for the generative model assumptions, the
most important of which is that latent word vectors are fairly uniformly
dispersed in space.

Recently, there has been considerable progress on designing algorithms with
provable guarantees  typically using linear algebraic methods  for
parameter learning in latent variable models. But designing provable algorithms
for inference has proven to be more challenging. Here we take a first step
towards provable inference in topic models. We leverage a property of topic
models that enables us to construct simple linear estimators for the unknown
topic proportions that have small variance, and consequently can work with
short documents. Our estimators also correspond to finding an estimate around
which the posterior is wellconcentrated. We show lower bounds that for shorter
documents it can be information theoretically impossible to find the hidden
topics. Finally, we give empirical results that demonstrate that our algorithm
works on realistic topic models. It yields good solutions on synthetic data and
runs in time comparable to a {\em single} iteration of Gibbs sampling.

We study the tradeoff between the statistical error and communication cost of
distributed statistical estimation problems in high dimensions. In the
distributed sparse Gaussian mean estimation problem, each of the $m$ machines
receives $n$ data points from a $d$dimensional Gaussian distribution with
unknown mean $\theta$ which is promised to be $k$sparse. The machines
communicate by message passing and aim to estimate the mean $\theta$. We
provide a tight (up to logarithmic factors) tradeoff between the estimation
error and the number of bits communicated between the machines. This directly
leads to a lower bound for the distributed \textit{sparse linear regression}
problem: to achieve the statistical minimax error, the total communication is
at least $\Omega(\min\{n,d\}m)$, where $n$ is the number of observations that
each machine receives and $d$ is the ambient dimension. These lower results
improve upon [Sha14,SD'14] by allowing multiround iterative communication
model. We also give the first optimal simultaneous protocol in the dense case
for mean estimation.
As our main technique, we prove a \textit{distributed data processing
inequality}, as a generalization of usual data processing inequalities, which
might be of independent interest and useful for other problems.

We study distributed optimization algorithms for minimizing the average of
convex functions. The applications include empirical risk minimization problems
in statistical machine learning where the datasets are large and have to be
stored on different machines. We design a distributed stochastic variance
reduced gradient algorithm that, under certain conditions on the condition
number, simultaneously achieves the optimal parallel runtime, amount of
communication and rounds of communication among all distributed firstorder
methods up to constant factors. Our method and its accelerated extension also
outperform existing distributed algorithms in terms of the rounds of
communication as long as the condition number is not too large compared to the
size of data in each machine. We also prove a lower bound for the number of
rounds of communication for a broad class of distributed firstorder methods
including the proposed algorithms in this paper. We show that our accelerated
distributed stochastic variance reduced gradient algorithm achieves this lower
bound so that it uses the fewest rounds of communication among all distributed
firstorder algorithms.

Generative models for deep learning are promising both to improve
understanding of the model, and yield training methods requiring fewer labeled
samples.
Recent works use generative model approaches to produce the deep net's input
given the value of a hidden layer several levels above. However, there is no
accompanying "proof of correctness" for the generative model, showing that the
feedforward deep net is the correct inference method for recovering the hidden
layer given the input. Furthermore, these models are complicated.
The current paper takes a more theoretical tack. It presents a very simple
generative model for RELU deep nets, with the following characteristics: (i)
The generative model is just the reverse of the feedforward net: if the forward
transformation at a layer is $A$ then the reverse transformation is $A^T$.
(This can be seen as an explanation of the old weight tying idea for denoising
autoencoders.) (ii) Its correctness can be proven under a clean theoretical
assumption: the edge weights in reallife deep nets behave like random numbers.
Under this assumption which is experimentally tested on reallife nets like
AlexNet it is formally proved that feed forward net is a correct inference
method for recovering the hidden layer.
The generative model suggests a simple modification for training: use the
generative model to produce synthetic data with labels and include it in the
training set. Experiments are shown to support this theory of randomlike deep
nets; and that it helps the training.

This paper establishes a statistical versus computational tradeoff for
solving a basic highdimensional machine learning problem via a basic convex
relaxation method. Specifically, we consider the {\em Sparse Principal
Component Analysis} (Sparse PCA) problem, and the family of {\em
SumofSquares} (SoS, aka Lasserre/Parillo) convex relaxations. It was well
known that in large dimension $p$, a planted $k$sparse unit vector can be {\em
in principle} detected using only $n \approx k\log p$ (Gaussian or Bernoulli)
samples, but all {\em efficient} (polynomial time) algorithms known require $n
\approx k^2$ samples. It was also known that this quadratic gap cannot be
improved by the the most basic {\em semidefinite} (SDP, aka spectral)
relaxation, equivalent to a degree2 SoS algorithms. Here we prove that also
degree4 SoS algorithms cannot improve this quadratic gap. This averagecase
lower bound adds to the small collection of hardness results in machine
learning for this powerful family of convex relaxation algorithms. Moreover,
our design of moments (or "pseudoexpectations") for this lower bound is quite
different than previous lower bounds. Establishing lower bounds for higher
degree SoS algorithms for remains a challenging problem.

Tensor rank and lowrank tensor decompositions have many applications in
learning and complexity theory. Most known algorithms use unfoldings of tensors
and can only handle rank up to $n^{\lfloor p/2 \rfloor}$ for a $p$th order
tensor in $\mathbb{R}^{n^p}$. Previously no efficient algorithm can decompose
3rd order tensors when the rank is superlinear in the dimension. Using ideas
from sumofsquares hierarchy, we give the first quasipolynomial time
algorithm that can decompose a random 3rd order tensor decomposition when the
rank is as large as $n^{3/2}/\textrm{polylog} n$.
We also give a polynomial time algorithm for certifying the injective norm of
random low rank tensors. Our tensor decomposition algorithm exploits the
relationship between injective norm and the tensor components. The proof relies
on interesting tools for decoupling random variables to prove better matrix
concentration bounds, which can be useful in other settings.

Sparse coding is a basic task in many fields including signal processing,
neuroscience and machine learning where the goal is to learn a basis that
enables a sparse representation of a given set of data, if one exists. Its
standard formulation is as a nonconvex optimization problem which is solved in
practice by heuristics based on alternating minimization. Re cent work has
resulted in several algorithms for sparse coding with provable guarantees, but
somewhat surprisingly these are outperformed by the simple alternating
minimization heuristics. Here we give a general framework for understanding
alternating minimization which we leverage to analyze existing heuristics and
to design new ones also with provable guarantees. Some of these algorithms seem
implementable on simple neural architectures, which was the original motivation
of Olshausen and Field (1997a) in introducing sparse coding. We also give the
first efficient algorithm for sparse coding that works almost up to the
information theoretic limit for sparse recovery on incoherent dictionaries. All
previous algorithms that approached or surpassed this limit run in time
exponential in some natural parameter. Finally, our algorithms improve upon the
sample complexity of existing approaches. We believe that our analysis
framework will have applications in other settings where simple iterative
algorithms are used.

We explore the connection between dimensionality and communication cost in
distributed learning problems. Specifically we study the problem of estimating
the mean $\vec{\theta}$ of an unknown $d$ dimensional gaussian distribution in
the distributed setting. In this problem, the samples from the unknown
distribution are distributed among $m$ different machines. The goal is to
estimate the mean $\vec{\theta}$ at the optimal minimax rate while
communicating as few bits as possible. We show that in this setting, the
communication cost scales linearly in the number of dimensions i.e. one needs
to deal with different dimensions individually. Applying this result to
previous lower bounds for one dimension in the interactive setting
\cite{ZDJW13} and to our improved bounds for the simultaneous setting, we prove
new lower bounds of $\Omega(md/\log(m))$ and $\Omega(md)$ for the bits of
communication needed to achieve the minimax squared loss, in the interactive
and simultaneous settings respectively. To complement, we also demonstrate an
interactive protocol achieving the minimax squared loss with $O(md)$ bits of
communication, which improves upon the simple simultaneous protocol by a
logarithmic factor. Given the strong lower bounds in the general setting, we
initiate the study of the distributed parameter estimation problems with
structured parameters. Specifically, when the parameter is promised to be
$s$sparse, we show a simple thresholding based protocol that achieves the same
squared loss while saving a $d/s$ factor of communication. We conjecture that
the tradeoff between communication and squared loss demonstrated by this
protocol is essentially optimal up to logarithmic factor.

In dictionary learning, also known as sparse coding, the algorithm is given
samples of the form $y = Ax$ where $x\in \mathbb{R}^m$ is an unknown random
sparse vector and $A$ is an unknown dictionary matrix in $\mathbb{R}^{n\times
m}$ (usually $m > n$, which is the overcomplete case). The goal is to learn $A$
and $x$. This problem has been studied in neuroscience, machine learning,
visions, and image processing. In practice it is solved by heuristic algorithms
and provable algorithms seemed hard to find. Recently, provable algorithms were
found that work if the unknown feature vector $x$ is $\sqrt{n}$sparse or even
sparser. Spielman et al. \cite{DBLP:journals/jmlr/SpielmanWW12} did this for
dictionaries where $m=n$; Arora et al. \cite{AGM} gave an algorithm for
overcomplete ($m >n$) and incoherent matrices $A$; and Agarwal et al.
\cite{DBLP:journals/corr/AgarwalAN13} handled a similar case but with weaker
guarantees.
This raised the problem of designing provable algorithms that allow sparsity
$\gg \sqrt{n}$ in the hidden vector $x$. The current paper designs algorithms
that allow sparsity up to $n/poly(\log n)$. It works for a class of matrices
where features are individually recoverable, a new notion identified in this
paper that may motivate further work.
The algorithm runs in quasipolynomial time because they use limited
enumeration.

We give algorithms with provable guarantees that learn a class of deep nets
in the generative model view popularized by Hinton and others. Our generative
model is an $n$ node multilayer neural net that has degree at most $n^{\gamma}$
for some $\gamma <1$ and each edge has a random edge weight in $[1,1]$. Our
algorithm learns {\em almost all} networks in this class with polynomial
running time. The sample complexity is quadratic or cubic depending upon the
details of the model.
The algorithm uses layerwise learning. It is based upon a novel idea of
observing correlations among features and using these to infer the underlying
edge structure via a global graph recovery procedure. The analysis of the
algorithm reveals interesting structure of neural networks with random edge
weights.

We study the matroid secretary problems with submodular valuation functions.
In these problems, the elements arrive in random order. When one element
arrives, we have to make an immediate and irrevocable decision on whether to
accept it or not. The set of accepted elements must form an {\em independent
set} in a predefined matroid. Our objective is to maximize the value of the
accepted elements. In this paper, we focus on the case that the valuation
function is a nonnegative and monotonically nondecreasing submodular
function.
We introduce a general algorithm for such {\em submodular matroid secretary
problems}. In particular, we obtain constant competitive algorithms for the
cases of laminar matroids and transversal matroids. Our algorithms can be
further applied to any independent set system defined by the intersection of a
{\em constant} number of laminar matroids, while still achieving constant
competitive ratios. Notice that laminar matroids generalize uniform matroids
and partition matroids.
On the other hand, when the underlying valuation function is linear, our
algorithm achieves a competitive ratio of 9.6 for laminar matroids, which
significantly improves the previous result.

Motivated by a hat guessing problem proposed by Iwasawa \cite{Iwasawa10},
Butler and Graham \cite{Butler11} made the following conjecture on the
existence of certain way of marking the {\em coordinate lines} in $[k]^n$:
there exists a way to mark one point on each {\em coordinate line} in $[k]^n$,
so that every point in $[k]^n$ is marked exactly $a$ or $b$ times as long as
the parameters $(a,b,n,k)$ satisfies that there are nonnegative integers $s$
and $t$ such that $s+t = k^n$ and $as+bt = nk^{n1}$. In this paper we prove
this conjecture for any prime number $k$. Moreover, we prove the conjecture for
the case when $a=0$ for general $k$.