
Matrix completion is a basic machine learning problem that has wide
applications, especially in collaborative filtering and recommender systems.
Simple nonconvex optimization algorithms are popular and effective in
practice. Despite recent progress in proving various nonconvex algorithms
converge from a good initial point, it remains unclear why random or arbitrary
initialization suffices in practice. We prove that the commonly used nonconvex
objective function for \textit{positive semidefinite} matrix completion has no
spurious local minima  all local minima must also be global. Therefore, many
popular optimization algorithms such as (stochastic) gradient descent can
provably solve positive semidefinite matrix completion with \textit{arbitrary}
initialization in polynomial time. The result can be generalized to the setting
when the observed entries contain noise. We believe that our main proof
strategy can be useful for understanding geometric properties of other
statistical problems involving partial or noisy observations.

Matrix completion is a wellstudied problem with many machine learning
applications. In practice, the problem is often solved by nonconvex
optimization algorithms. However, the current theoretical analysis for
nonconvex algorithms relies heavily on the assumption that every entry is
observed with exactly the same probability $p$, which is not realistic in
practice.
In this paper, we investigate a more realistic semirandom model, where the
probability of observing each entry is at least $p$. Even with this mild
semirandom perturbation, we can construct counterexamples where existing
nonconvex algorithms get stuck in bad local optima.
In light of the negative results, we propose a preprocessing step that tries
to reweight the semirandom input, so that it becomes "similar" to a random
input. We give a nearlylinear time algorithm for this problem, and show that
after our preprocessing, all the local minima of the nonconvex objective can
be used to approximately recover the underlying groundtruth matrix.

Population riskthe expectation of the loss over the sampling
mechanismis always of primary interest in machine learning. However,
learning algorithms only have access to empirical risk, which is the average
loss over training examples. Although the two risks are typically guaranteed to
be pointwise close, for applications with nonconvex nonsmooth losses (such as
modern deep networks), the effects of sampling can transform a wellbehaved
population risk into an empirical risk with a landscape that is problematic for
optimization. The empirical risk can be nonsmooth, and it may have many
additional local minima.
This paper considers a general optimization framework which aims to find
approximate local minima of a smooth nonconvex function $F$ (population risk)
given only access to the function value of another function $f$ (empirical
risk), which is pointwise close to $F$ (i.e., $\Ff\_{\infty} \le \nu$). We
propose a simple algorithm based on stochastic gradient descent (SGD) on a
smoothed version of $f$ which is guaranteed to find an $\epsilon$secondorder
stationary point if $\nu \le O(\epsilon^{1.5}/d)$, thus escaping all saddle
points of $F$ and all the additional local minima introduced by $f$. We also
provide an almost matching lower bound showing that our SGDbased approach
achieves the optimal tradeoff between $\nu$ and $\epsilon$, as well as the
optimal dependence on problem dimension $d$, among all algorithms making a
polynomial number of queries. As a concrete example, we show that our results
can be directly used to give sample complexities for learning a ReLU unit,
whose empirical risk is nonsmooth.

Deep nets generalize well despite having more parameters than the number of
training samples. Recent works try to give an explanation using PACBayes and
Marginbased analyses, but do not as yet result in sample complexity bounds
better than naive parameter counting. The current paper shows generalization
bounds that're orders of magnitude better in practice. These rely upon new
succinct reparametrizations of the trained net  a compression that is
explicit and efficient. These yield generalization bounds via a simple
compressionbased framework introduced here. Our results also provide some
theoretical justification for widespread empirical success in compressing deep
nets. Analysis of correctness of our compression relies upon some newly
identified \textquotedblleft noise stability\textquotedblright properties of
trained deep nets, which are also experimentally verified. The study of these
properties and resulting generalization bounds are also extended to
convolutional nets, which had eluded earlier attempts on proving
generalization.

Direct policy gradient methods for reinforcement learning and continuous
control problems are a popular approach for a variety of reasons: 1) they are
easy to implement without explicit knowledge of the underlying model 2) they
are an "endtoend" approach, directly optimizing the performance metric of
interest 3) they inherently allow for richly parameterized policies. A notable
drawback is that even in the most basic continuous control problem (that of
linear quadratic regulators), these methods must solve a nonconvex
optimization problem, where little is understood about their efficiency from
both computational and statistical perspectives. In contrast, system
identification and model based planning in optimal control theory have a much
more solid theoretical footing, where much is known with regards to their
computational and statistical properties. This work bridges this gap showing
that (model free) policy gradient methods globally converge to the optimal
solution and are efficient (polynomially so in relevant problem dependent
quantities) with regards to their sample and computational complexities.

A key task in Bayesian statistics is sampling from distributions that are
only specified up to a partition function (i.e., constant of proportionality).
However, without any assumptions, sampling (even approximately) can be #Phard,
and few works have provided "beyond worstcase" guarantees for such settings.
For logconcave distributions, classical results going back to Bakry and
\'Emery (1985) show that natural continuoustime Markov chains called Langevin
diffusions mix in polynomial time. The most salient feature of logconcavity
violated in practice is unimodality: commonly, the distributions we wish to
sample from are multimodal. In the presence of multiple deep and
wellseparated modes, Langevin diffusion suffers from torpid mixing.
We address this problem by combining Langevin diffusion with simulated
tempering. The result is a Markov chain that mixes more rapidly by
transitioning between different temperatures of the distribution. We analyze
this Markov chain for the canonical multimodal distribution: a mixture of
gaussians (of equal variance). The algorithm based on our Markov chain provably
samples from distributions that are close to mixtures of gaussians, given
access to the gradient of the logpdf. For the analysis, we use a spectral
decomposition theorem for graphs (Gharan and Trevisan, 2014) and a Markov chain
decomposition technique (Madras and Randall, 2002).

In this paper, we introduce the online service with delay problem. In this
problem, there are $n$ points in a metric space that issue service requests
over time, and a server that serves these requests. The goal is to minimize the
sum of distance traveled by the server and the total delay in serving the
requests. This problem models the fundamental tradeoff between batching
requests to improve locality and reducing delay to improve response time, that
has many applications in operations management, operating systems, logistics,
supply chain management, and scheduling.
Our main result is to show a polylogarithmic competitive ratio for the
online service with delay problem. This result is obtained by an algorithm that
we call the preemptive service algorithm. The salient feature of this algorithm
is a process called preemptive service, which uses a novel combination of
(recursive) time forwarding and spatial exploration on a metric space. We hope
this technique will be useful for related problems such as reordering buffer
management, online TSP, vehicle routing, etc. We also generalize our results to
$k > 1$ servers.

We show that training of generative adversarial network (GAN) may not have
good generalization properties; e.g., training may appear successful but the
trained distribution may be far from target distribution in standard metrics.
However, generalization does occur for a weaker metric called neural net
distance. It is also shown that an approximate pure equilibrium exists in the
discriminator/generator game for a special class of generators with natural
training objectives when generator capacity and training set sizes are
moderate.
This existence of equilibrium inspires MIX+GAN protocol, which can be
combined with any existing GAN training, and empirically shown to improve some
of them.

Nonconvex optimization with local search heuristics has been widely used in
machine learning, achieving many stateofart results. It becomes increasingly
important to understand why they can work for these NPhard problems on typical
data. The landscape of many objective functions in learning has been
conjectured to have the geometric property that "all local optima are
(approximately) global optima", and thus they can be solved efficiently by
local search algorithms. However, establishing such property can be very
difficult.
In this paper, we analyze the optimization landscape of the random
overcomplete tensor decomposition problem, which has many applications in
unsupervised learning, especially in learning latent variable models. In
practice, it can be efficiently solved by gradient ascent on a nonconvex
objective. We show that for any small constant $\epsilon > 0$, among the set of
points with function values $(1+\epsilon)$factor larger than the expectation
of the function, all the local maxima are approximate global maxima.
Previously, the bestknown result only characterizes the geometry in small
neighborhoods around the true components. Our result implies that even with an
initialization that is barely better than the random guess, the gradient ascent
algorithm is guaranteed to solve this problem.
Our main technique uses KacRice formula and random matrix theory. To our
best knowledge, this is the first time when KacRice formula is successfully
applied to counting the number of local minima of a highlystructured random
polynomial with dependent coefficients.

Developing efficient and guaranteed nonconvex algorithms has been an
important challenge in modern machine learning. Algorithms with good empirical
performance such as stochastic gradient descent often lack theoretical
guarantees. In this paper, we analyze the class of homotopy or continuation
methods for global optimization of nonconvex functions. These methods start
from an objective function that is efficient to optimize (e.g. convex), and
progressively modify it to obtain the required objective, and the solutions are
passed along the homotopy path. For the challenging problem of tensor PCA, we
prove global convergence of the homotopy method in the "high noise" regime. The
signaltonoise requirement for our algorithm is tight in the sense that it
matches the recovery guarantee for the best degree4 sumofsquares algorithm.
In addition, we prove a phase transition along the homotopy path for tensor
PCA. This allows to simplify the homotopy method to a local search algorithm,
viz., tensor power iterations, with a specific initialization and a noise
injection procedure, while retaining the theoretical guarantees.

Deep neural nets have caused a revolution in many classification tasks. A
related ongoing revolutionalso theoretically not understoodconcerns their
ability to serve as generative models for complicated types of data such as
images and texts. These models are trained using ideas like variational
autoencoders and Generative Adversarial Networks.
We take a first cut at explaining the expressivity of multilayer nets by
giving a sufficient criterion for a function to be approximable by a neural
network with $n$ hidden layers. A key ingredient is Barron's Theorem
\cite{Barron1993}, which gives a Fourier criterion for approximability of a
function by a neural network with 1 hidden layer. We show that a composition of
$n$ functions which satisfy certain Fourier conditions ("Barron functions") can
be approximated by a $n+1$layer neural network.
For probability distributions, this translates into a criterion for a
probability distribution to be approximable in Wasserstein distancea natural
metric on probability distributionsby a neural network applied to a fixed
base distribution (e.g., multivariate gaussian).
Building up recent lower bound work, we also give an example function that
shows that composition of Barron functions is more expressive than Barron
functions alone.

In this paper we develop a new framework that captures the common landscape
underlying the common nonconvex lowrank matrix problems including matrix
sensing, matrix completion and robust PCA. In particular, we show for all above
problems (including asymmetric cases): 1) all local minima are also globally
optimal; 2) no highorder saddle points exists. These results explain why
simple algorithms such as stochastic gradient descent have global converge, and
efficiently optimize these nonconvex objective functions in practice. Our
framework connects and simplifies the existing analyses on optimization
landscapes for matrix sensing and symmetric matrix completion. The framework
naturally leads to new results for asymmetric matrix completion and robust PCA.

This paper shows that a perturbed form of gradient descent converges to a
secondorder stationary point in a number iterations which depends only
polylogarithmically on dimension (i.e., it is almost "dimensionfree"). The
convergence rate of this procedure matches the wellknown convergence rate of
gradient descent to firstorder stationary points, up to log factors. When all
saddle points are nondegenerate, all secondorder stationary points are local
minima, and our result thus shows that perturbed gradient descent can escape
saddle points almost for free. Our results can be directly applied to many
machine learning applications, including deep learning. As a particular
concrete example of such an application, we show that our results can be used
directly to establish sharp global convergence rates for matrix factorization.
Our results rely on a novel characterization of the geometry around saddle
points, which may be of independent interest to the nonconvex optimization
community.

Many machine learning applications use latent variable models to explain
structure in data, whereby visible variables (= coordinates of the given
datapoint) are explained as a probabilistic function of some hidden variables.
Finding parameters with the maximum likelihood is NPhard even in very simple
settings. In recent years, provably efficient algorithms were nevertheless
developed for models with linear structures: topic models, mixture models,
hidden markov models, etc. These algorithms use matrix or tensor decomposition,
and make some reasonable assumptions about the parameters of the underlying
model.
But matrix or tensor decomposition seems of little use when the latent
variable model has nonlinearities. The current paper shows how to make
progress: tensor decomposition is applied for learning the singlelayer {\em
noisy or} network, which is a textbook example of a Bayes net, and used for
example in the classic QMRDT software for diagnosing which disease(s) a
patient may have by observing the symptoms he/she exhibits.
The technical novelty here, which should be useful in other settings in
future, is analysis of tensor decomposition in presence of systematic error
(i.e., where the noise/error is correlated with the signal, and doesn't
decrease as number of samples goes to infinity). This requires rethinking all
steps of tensor decomposition methods from the ground up.
For simplicity our analysis is stated assuming that the network parameters
were chosen from a probability distribution but the method seems more generally
applicable.

In order to boost the performance of dataintensive computing on HPC systems,
inmemory computing frameworks, such as Apache Spark and Flink, use local DRAM
for data storage. Optimizing the memory allocation to data storage is critical
to delivering performance to traditional HPC compute jobs and throughput to
dataintensive applications sharing the HPC resources. Current practices that
statically configure inmemory storage may leave inadequate space for compute
jobs or lose the opportunity to utilize more available space for dataintensive
applications. In this paper, we explore techniques to dynamically adjust
inmemory storage and make the right amount of space for compute jobs. We have
developed a dynamic memory controller, DynIMS, which infers memory demands of
compute tasks online and employs a feedbackbased control model to adapt the
capacity of inmemory storage. We test DynIMS using mixed HPCC and Spark
workloads on a HPC cluster. Experimental results show that DynIMS can achieve
up to 5X performance improvement compared to systems with static memory
allocations.

Recently, there has been considerable progress on designing algorithms with
provable guarantees  typically using linear algebraic methods  for
parameter learning in latent variable models. But designing provable algorithms
for inference has proven to be more challenging. Here we take a first step
towards provable inference in topic models. We leverage a property of topic
models that enables us to construct simple linear estimators for the unknown
topic proportions that have small variance, and consequently can work with
short documents. Our estimators also correspond to finding an estimate around
which the posterior is wellconcentrated. We show lower bounds that for shorter
documents it can be information theoretically impossible to find the hidden
topics. Finally, we give empirical results that demonstrate that our algorithm
works on realistic topic models. It yields good solutions on synthetic data and
runs in time comparable to a {\em single} iteration of Gibbs sampling.

This paper considers the problem of canonicalcorrelation analysis (CCA)
(Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a
pair of symmetric matrices. These are two fundamental problems in data analysis
and scientific computing with numerous applications in machine learning and
statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009).
We provide simple iterative algorithms, with improved runtimes, for solving
these problems that are globally linearly convergent with moderate dependencies
on the condition numbers and eigenvalue gaps of the matrices involved.
We obtain our results by reducing CCA to the top$k$ generalized eigenvector
problem. We solve this problem through a general framework that simply requires
black box access to an approximate linear system solver. Instantiating this
framework with accelerated gradient descent we obtain a running time of
$O(\frac{z k \sqrt{\kappa}}{\rho} \log(1/\epsilon) \log
\left(k\kappa/\rho\right))$ where $z$ is the total number of nonzero entries,
$\kappa$ is the condition number and $\rho$ is the relative eigenvalue gap of
the appropriate matrices.
Our algorithm is linear in the input size and the number of components $k$ up
to a $\log(k)$ factor. This is essential for handling largescale matrices that
appear in practice. To the best of our knowledge this is the first such
algorithm with global linear convergence. We hope that our results prompt
further research and ultimately improve the practical running time for
performing these important data analysis procedures on large data sets.

Local search heuristics for nonconvex optimizations are popular in applied
machine learning. However, in general it is hard to guarantee that such
algorithms even converge to a local minimum, due to the existence of
complicated saddle point structures in high dimensions. Many functions have
degenerate saddle points such that the first and second order derivatives
cannot distinguish them with local optima. In this paper we use higher order
derivatives to escape these saddle points: we design the first efficient
algorithm guaranteed to converge to a third order local optimum (while existing
techniques are at most second order). We also show that it is NPhard to extend
this further to finding fourth order local optima.

Consider a stationary discrete random process with alphabet size d, which is
assumed to be the output process of an unknown stationary Hidden Markov Model
(HMM). Given the joint probabilities of finite length strings of the process,
we are interested in finding a finite state generative model to describe the
entire process. In particular, we focus on two classes of models: HMMs and
quasiHMMs, which is a strictly larger class of models containing HMMs. In the
main theorem, we show that if the random process is generated by an HMM of
order less or equal than k, and whose transition and observation probability
matrix are in general position, namely almost everywhere on the parameter
space, both the minimal quasiHMM realization and the minimal HMM realization
can be efficiently computed based on the joint probabilities of all the length
N strings, for N > 4 lceil log_d(k) rceil +1. In this paper, we also aim to
compare and connect the two lines of literature: realization theory of HMMs,
and the recent development in learning latent variable models with tensor
decomposition techniques.

Dataintensive computing has become one of the major workloads on traditional
highperformance computing (HPC) clusters. Currently, deploying dataintensive
computing software framework on HPC clusters still faces performance and
scalability issues. In this paper, we develop a new twolevel storage system by
integrating Tachyon, an inmemory file system with OrangeFS, a parallel file
system. We model the I/O throughputs of four storage structures: HDFS,
OrangeFS, Tachyon and twolevel storage. We conduct computational experiments
to characterize I/O throughput behavior of twolevel storage and compare its
performance to that of HDFS and OrangeFS, using TeraSort benchmark. Theoretical
models and experimental tests both show that the twolevel storage system can
increase the aggregate I/O throughputs. This work lays a solid foundation for
future work in designing and building HPC systems that can provide a better
support on I/O intensive workloads with preserving existing computing
resources.

We present a novel analysis of the dynamics of tensor power iterations in the
overcomplete regime where the tensor CP rank is larger than the input
dimension. Finding the CP decomposition of an overcomplete tensor is NPhard in
general. We consider the case where the tensor components are randomly drawn,
and show that the simple power iteration recovers the components with bounded
error under mild initialization conditions. We apply our analysis to
unsupervised learning of latent variable models, such as multiview mixture
models and spherical Gaussian mixtures. Given the third order moment tensor, we
learn the parameters using tensor power iterations. We prove it can correctly
learn the model parameters when the number of hidden components $k$ is much
larger than the data dimension $d$, up to $k = o(d^{1.5})$. We initialize the
power iterations with data samples and prove its success under mild conditions
on the signaltonoise ratio of the samples. Our analysis significantly expands
the class of latent variable models where spectral methods are applicable. Our
analysis also deals with noise in the input tensor leading to sample complexity
result in the application to learning latent variable models.

In many settings, we have multiple data sets (also called views) that capture
different and overlapping aspects of the same phenomenon. We are often
interested in finding patterns that are unique to one or to a subset of the
views. For example, we might have one set of molecular observations and one set
of physiological observations on the same group of individuals, and we want to
quantify molecular patterns that are uncorrelated with physiology. Despite
being a common problem, this is highly challenging when the correlations come
from complex distributions. In this paper, we develop the general framework of
Rich Component Analysis (RCA) to model settings where the observations from
different views are driven by different sets of latent components, and each
component can be a complex, highdimensional distribution. We introduce
algorithms based on cumulant extraction that provably learn each of the
components without having to model the other components. We show how to
integrate RCA with stochastic gradient descent into a metaalgorithm for
learning general models, and demonstrate substantial improvement in accuracy on
several synthetic and real datasets in both supervised and unsupervised tasks.
Our method makes it possible to learn latent variable models when we don't have
samples from the true model but only samples after complex perturbations.

Nonnegative matrix factorization (NMF) is a natural model of admixture and
is widely used in science and engineering. A plethora of algorithms have been
developed to tackle NMF, but due to the nonconvex nature of the problem, there
is little guarantee on how well these methods work. Recently a surge of
research have focused on a very restricted class of NMFs, called separable NMF,
where provably correct algorithms have been developed. In this paper, we
propose the notion of subsetseparable NMF, which substantially generalizes the
property of separability. We show that subsetseparability is a natural
necessary condition for the factorization to be unique or to have minimum
volume. We developed the FaceIntersect algorithm which provably and
efficiently solves subsetseparable NMF under natural conditions, and we prove
that our algorithm is robust to small noise. We explored the performance of
FaceIntersect on simulations and discuss settings where it empirically
outperformed the stateofart methods. Our work is a step towards finding
provably correct algorithms that solve large classes of NMF problems.

We develop a family of accelerated stochastic algorithms that minimize sums
of convex functions. Our algorithms improve upon the fastest running time for
empirical risk minimization (ERM), and in particular linear leastsquares
regression, across a wide range of problem settings. To achieve this, we
establish a framework based on the classical proximal point algorithm. Namely,
we provide several algorithms that reduce the minimization of a strongly convex
function to approximate minimizations of regularizations of the function. Using
these results, we accelerate recent fast stochastic algorithms in a blackbox
fashion. Empirically, we demonstrate that the resulting algorithms exhibit
notions of stability that are advantageous in practice. Both in theory and in
practice, the provided algorithms reap the computational benefits of adding a
large strongly convex regularization term, without incurring a corresponding
bias to the original problem.

Tensor rank and lowrank tensor decompositions have many applications in
learning and complexity theory. Most known algorithms use unfoldings of tensors
and can only handle rank up to $n^{\lfloor p/2 \rfloor}$ for a $p$th order
tensor in $\mathbb{R}^{n^p}$. Previously no efficient algorithm can decompose
3rd order tensors when the rank is superlinear in the dimension. Using ideas
from sumofsquares hierarchy, we give the first quasipolynomial time
algorithm that can decompose a random 3rd order tensor decomposition when the
rank is as large as $n^{3/2}/\textrm{polylog} n$.
We also give a polynomial time algorithm for certifying the injective norm of
random low rank tensors. Our tensor decomposition algorithm exploits the
relationship between injective norm and the tensor components. The proof relies
on interesting tools for decoupling random variables to prove better matrix
concentration bounds, which can be useful in other settings.