
Given a nonconvex function that is an average of $n$ smooth functions, we
design stochastic firstorder methods to find its approximate stationary
points. The convergence of our new methods depends on the smallest (negative)
eigenvalue $\sigma$ of the Hessian, a parameter that describes how nonconvex
the function is.
Our methods outperform known results for a range of parameter $\sigma$, and
can be used to find approximate local minima. Our result implies an interesting
dichotomy: there exists a threshold $\sigma_0$ so that the currently fastest
methods for $\sigma>\sigma_0$ and for $\sigma<\sigma_0$ have different
behaviors: the former scales with $n^{2/3}$ and the latter scales with
$n^{3/4}$.

Nesterov's momentum trick is famously known for accelerating gradient
descent, and has been proven useful in building fast iterative algorithms.
However, in the stochastic setting, counterexamples exist and prevent
Nesterov's momentum from providing similar acceleration, even if the underlying
problem is convex and finitesum.
We introduce $\mathtt{Katyusha}$, a direct, primalonly stochastic gradient
method to fix this issue. In convex finitesum stochastic optimization,
$\mathtt{Katyusha}$ has an optimal accelerated convergence rate, and enjoys an
optimal parallel linear speedup in the minibatch setting.
The main ingredient is $\textit{Katyusha momentum}$, a novel "negative
momentum" on top of Nesterov's momentum. It can be incorporated into a
variancereduction based algorithm and speed it up, both in terms of
$\textit{sequential and parallel}$ performance. Since variance reduction has
been successfully applied to a growing list of practical problems, our paper
suggests that in each of such cases, one could potentially try to give Katyusha
a hug.

We design a stochastic algorithm to train any smooth neural network to
$\varepsilon$approximate local minima, using $O(\varepsilon^{3.25})$
backpropagations. The best result was essentially $O(\varepsilon^{4})$ by SGD.
More broadly, it finds $\varepsilon$approximate local minima of any smooth
nonconvex function in rate $O(\varepsilon^{3.25})$, with only oracle access to
stochastic gradients.

We propose a reduction for nonconvex optimization that can (1) turn an
stationarypoint finding algorithm into an localminimum finding one, and (2)
replace the Hessianvector product computations with only gradient
computations. It works both in the stochastic and the deterministic settings,
without hurting the algorithm's performance.
As applications, our reduction turns Natasha2 into a firstorder method
without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into
algorithms finding approximate local minima, outperforming some best known
results.

We propose a new secondorder method for geodesically convex optimization on
the natural hyperbolic metric over positive definite matrices. We apply it to
solve the operator scaling problem in time polynomial in the input size and
logarithmic in the error. This is an exponential improvement over previous
algorithms which were analyzed in the usual Euclidean, "commutative" metric
(for which the above problem is not convex). Our method is general and
applicable to other settings.
As a consequence, we solve the equivalence problem for the leftright group
action underlying the operator scaling problem. This yields a deterministic
polynomialtime algorithm for a new class of Polynomial Identity Testing (PIT)
problems, which was the original motivation for studying operator scaling.

This paper studies the problem of distributed stochastic optimization in an
adversarial setting where, out of the $m$ machines which allegedly compute
stochastic gradients every iteration, an $\alpha$fraction are Byzantine, and
can behave arbitrarily and adversarially. Our main result is a variant of
stochastic gradient descent (SGD) which finds $\varepsilon$approximate
minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m}
+ \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional
minibatch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations,
but cannot tolerate Byzantine failures. Further, we provide a lower bound
showing that, up to logarithmic factors, our algorithm is
informationtheoretically optimal both in terms of sampling complexity and time
complexity.

Packing and covering linear programs (PCLPs) form an important class of
linear programs (LPs) across computer science, operations research, and
optimization. In 1993, Luby and Nisan constructed an iterative algorithm for
approximately solving PCLPs in nearly linear time, where the time complexity
scales nearly linearly in $N$, the number of nonzero entries of the matrix, and
polynomially in $\varepsilon$, the (multiplicative) approximation error.
Unfortunately, all existing nearly lineartime algorithms for solving PCLPs
require time at least proportional to $\varepsilon^{2}$.
In this paper, we break this longstanding barrier by designing a packing
solver that runs in time $\tilde{O}(N \varepsilon^{1})$ and covering LP solver
that runs in time $\tilde{O}(N \varepsilon^{1.5})$. Our packing solver can be
extended to run in time $\tilde{O}(N \varepsilon^{1})$ for a class of
wellbehaved covering programs. In a followup work, Wang et al. showed that
all covering LPs can be converted into wellbehaved ones by a reduction that
blows up the problem size only logarithmically.
At high level, these two algorithms can be described as linear couplings of
several firstorder descent steps. This is an application of our linear
coupling technique to problems that are not amenable to blackbox applications
known iterative algorithms in convex optimization.

The problem of minimizing sumofnonconvex functions (i.e., convex functions
that are average of nonconvex ones) is becoming increasingly important in
machine learning, and is the core machinery for PCA, SVD, regularized Newton's
method, accelerated nonconvex optimization, and more.
We show how to provably obtain an accelerated stochastic algorithm for
minimizing sumofnonconvex functions, by $\textit{adding one additional line}$
to the wellknown SVRG method. This line corresponds to momentum, and shows how
to directly apply momentum to the finitesum stochastic minimization of
sumofnonconvex functions. As a side result, our method enjoys linear parallel
speedup using minibatch.

Regret bounds in online learning compare the player's performance to $L^*$,
the optimal performance in hindsight with a fixed strategy. Typically such
bounds scale with the square root of the time horizon $T$. The more refined
concept of firstorder regret bound replaces this with a scaling $\sqrt{L^*}$,
which may be much smaller than $\sqrt{T}$. It is well known that minor variants
of standard algorithms satisfy firstorder regret bounds in the full
information and multiarmed bandit settings. In a COLT 2017 open problem,
Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that
existing techniques do not seem sufficient to obtain firstorder regret bounds
for the contextual bandit problem. In the present paper, we resolve this open
problem by presenting a new strategy based on augmenting the policy space.

In convex stochastic optimization, convergence rates in terms of minimizing
the objective have been wellestablished. However, in terms of making the
gradients small, the best known convergence rate was $O(\varepsilon^{8/3})$
and it was left open how to improve it.
In this paper, we improve this rate to $\tilde{O}(\varepsilon^{2})$, which
is optimal up to log factors.

We propose a rank$k$ variant of the classical FrankWolfe algorithm to solve
convex optimization over a tracenorm ball. Our algorithm replaces the top
singularvector computation ($1$SVD) in FrankWolfe with a top$k$
singularvector computation ($k$SVD), which can be done by repeatedly applying
$1$SVD $k$ times. Alternatively, our algorithm can be viewed as a rank$k$
restricted version of projected gradient descent. We show that our algorithm
has a linear convergence rate when the objective function is smooth and
strongly convex, and the optimal solution has rank at most $k$. This improves
the convergence rate and the total time complexity of the FrankWolfe method
and its variants.

The online problem of computing the top eigenvector is fundamental to machine
learning. In both adversarial and stochastic settings, previous results (such
as matrix multiplicative weight update, follow the regularized leader, follow
the compressed leader, block power method) either achieve optimal regret but
run slow, or run fast at the expense of loosing a $\sqrt{d}$ factor in total
regret where $d$ is the matrix dimension.
We propose a $\textit{followthecompressedleader (FTCL)}$ framework which
achieves optimal regret without sacrificing the running time. Our idea is to
"compress" the matrix strategy to dimension 3 in the adversarial setting, or
dimension 1 in the stochastic setting. These respectively resolve two open
questions regarding the design of optimal and efficient algorithms for the
online eigenvector problem.

We design a nonconvex secondorder optimization algorithm that is guaranteed
to return an approximate local minimum in time which scales linearly in the
underlying dimension and the number of training examples. The time complexity
of our algorithm to find an approximate local minimum is even faster than that
of gradient descent to find a critical point. Our algorithm applies to a
general class of optimization problems including training a neural network and
other nonconvex objectives arising in machine learning.

We solve principal component regression (PCR), up to a multiplicative
accuracy $1+\gamma$, by reducing the problem to $\tilde{O}(\gamma^{1})$
blackbox calls of ridge regression. Therefore, our algorithm does not require
any explicit construction of the top principal components, and is suitable for
largescale PCR instances. In contrast, previous result requires
$\tilde{O}(\gamma^{2})$ such blackbox calls.
We obtain this result by developing a general stable recurrence formula for
matrix Chebyshev polynomials, and a degreeoptimal polynomial approximation to
the matrix sign function. Our techniques may be of independent interests,
especially when designing iterative methods.

We study streaming principal component analysis (PCA), that is to find, in
$O(dk)$ space, the top $k$ eigenvectors of a $d\times d$ hidden matrix $\bf
\Sigma$ with online vectors drawn from covariance matrix $\bf \Sigma$.
We provide $\textit{global}$ convergence for Oja's algorithm which is
popularly used in practice but lacks theoretical understanding for $k>1$. We
also provide a modified variant $\mathsf{Oja}^{++}$ that runs $\textit{even
faster}$ than Oja's. Our results match the information theoretic lower bound in
terms of dependency on error, on eigengap, on rank $k$, and on dimension $d$,
up to polylog factors. In addition, our convergence rate can be made gapfree,
that is proportional to the approximation error and independent of the
eigengap.
In contrast, for general rank $k$, before our work (1) it was open to design
any algorithm with efficient global convergence rate; and (2) it was open to
design any algorithm with (even local) gapfree convergence rate in $O(dk)$
space.

We develop several efficient algorithms for the classical \emph{Matrix
Scaling} problem, which is used in many diverse areas, from preconditioning
linear systems to approximation of the permanent. On an input $n\times n$
matrix $A$, this problem asks to find diagonal (scaling) matrices $X$ and $Y$
(if they exist), so that $X A Y$ $\varepsilon$approximates a doubly
stochastic, or more generally a matrix with prescribed row and column sums.
We address the general scaling problem as well as some important special
cases. In particular, if $A$ has $m$ nonzero entries, and if there exist $X$
and $Y$ with polynomially large entries such that $X A Y$ is doubly stochastic,
then we can solve the problem in total complexity $\tilde{O}(m + n^{4/3})$.
This greatly improves on the best known previous results, which were either
$\tilde{O}(n^4)$ or $O(m n^{1/2}/\varepsilon)$.
Our algorithms are based on tailormade first and second order techniques,
combined with other recent advances in continuous optimization, which may be of
independent interest for solving similar problems.

We study $k$SVD that is to obtain the first $k$ singular vectors of a matrix
$A$. Recently, a few breakthroughs have been discovered on $k$SVD: Musco and
Musco [1] proved the first gapfree convergence result using the block Krylov
method, Shamir [2] discovered the first variancereduction stochastic method,
and Bhojanapalli et al. [3] provided the fastest $O(\mathsf{nnz}(A) +
\mathsf{poly}(1/\varepsilon))$time algorithm using alternating minimization.
In this paper, we put forward a new and simple LazySVD framework to improve
the above breakthroughs. This framework leads to a faster gapfree method
outperforming [1], and the first accelerated and stochastic method
outperforming [2]. In the $O(\mathsf{nnz}(A) + \mathsf{poly}(1/\varepsilon))$
runningtime regime, LazySVD outperforms [3] in certain parameter regimes
without even using alternating minimization.

We study $k$GenEV, the problem of finding the top $k$ generalized
eigenvectors, and $k$CCA, the problem of finding the top $k$ vectors in
canonicalcorrelation analysis. We propose algorithms $\mathtt{LazyEV}$ and
$\mathtt{LazyCCA}$ to solve the two problems with running times linearly
dependent on the input size and on $k$.
Furthermore, our algorithms are DOUBLYACCELERATED: our running times depend
only on the square root of the matrix condition number, and on the square root
of the eigengap. This is the first such result for both $k$GenEV or $k$CCA.
We also provide the first gapfree results, which provide running times that
depend on $1/\sqrt{\varepsilon}$ rather than the eigengap.

Positive linear programs (LP), also known as packing and covering linear
programs, are an important class of problems that bridges computer science,
operations research, and optimization. Despite the consistent efforts on this
problem, all known nearlylineartime algorithms require
$\tilde{O}(\varepsilon^{4})$ iterations to converge to $1\pm \varepsilon$
approximate solutions. This $\varepsilon^{4}$ dependence has not been improved
since 1993, and limits the performance of parallel implementations for such
algorithms. Moreover, previous algorithms and their analyses rely on update
steps and convergence arguments that are combinatorial in nature and do not
seem to arise naturally from an optimization viewpoint.
In this paper, we leverage new insights from optimization theory to construct
a novel algorithm that breaks the longstanding $\varepsilon^{4}$ barrier. Our
algorithm has a simple analysis and a clear motivation. Our work introduces a
number of novel techniques, such as the combined application of gradient
descent and mirror descent, and a truncated, smoothed version of the standard
multiplicative weight update, which may be of independent interest.

Firstorder methods play a central role in largescale machine learning. Even
though many variations exist, each suited to a particular problem, almost all
such methods fundamentally rely on two types of algorithmic steps: gradient
descent, which yields primal progress, and mirror descent, which yields dual
progress.
We observe that the performances of gradient and mirror descent are
complementary, so that faster algorithms can be designed by LINEARLY COUPLING
the two. We show how to reconstruct Nesterov's accelerated gradient methods
using linear coupling, which gives a cleaner interpretation than Nesterov's
original proofs. We also discuss the power of linear coupling by extending it
to many other settings that Nesterov's methods cannot apply to.

The amount of data available in the world is growing faster than our ability
to deal with it. However, if we take advantage of the internal
\emph{structure}, data may become much smaller for machine learning purposes.
In this paper we focus on one of the fundamental machine learning tasks,
empirical risk minimization (ERM), and provide faster algorithms with the help
from the clustering structure of the data.
We introduce a simple notion of raw clustering that can be efficiently
computed from the data, and propose two algorithms based on clustering
information. Our accelerated algorithm ClusterACDM is built on a novel Haar
transformation applied to the dual space of the ERM problem, and our
variancereduction based algorithm ClusterSVRG introduces a new gradient
estimator using clustering. Our algorithms outperform their classical
counterparts ACDM and SVRG respectively.

We consider the fundamental problem in nonconvex optimization of efficiently
reaching a stationary point. In contrast to the convex case, in the long
history of this basic problem, the only known theoretical results on
firstorder nonconvex optimization remain to be full gradient descent that
converges in $O(1/\varepsilon)$ iterations for smooth objectives, and
stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations
for objectives that are sum of smooth functions.
We provide the first improvement in this line of research. Our result is
based on the variance reduction trick recently introduced to convex
optimization, as well as a brand new analysis of variance reduction that is
suitable for nonconvex optimization. For objectives that are sum of smooth
functions, our firstorder minibatch stochastic method converges with an
$O(1/\varepsilon)$ rate, and is faster than full gradient descent by
$\Omega(n^{1/3})$.
We demonstrate the effectiveness of our methods on empirical risk
minimizations with nonconvex loss functions and training neural nets.

Accelerated coordinate descent is widely used in optimization due to its
cheap periteration cost and scalability to largescale problems. Up to a
primaldual transformation, it is also the same as accelerated stochastic
gradient descent that is one of the central methods used in machine learning.
In this paper, we improve the best known running time of accelerated
coordinate descent by a factor up to $\sqrt{n}$. Our improvement is based on a
clean, novel nonuniform sampling that selects each coordinate with a
probability proportional to the square root of its smoothness parameter. Our
proof technique also deviates from the classical estimation sequence technique
used in prior work. Our speedup applies to important problems such as
empirical risk minimization and solving linear systems, both in theory and in
practice.

Many classical algorithms are found until several years later to outlive the
confines in which they were conceived, and continue to be relevant in
unforeseen settings. In this paper, we show that SVRG is one such method: being
originally designed for strongly convex objectives, it is also very robust in
nonstrongly convex or sumofnonconvex settings.
More precisely, we provide new analysis to improve the stateoftheart
running times in both settings by either applying SVRG or its novel variant.
Since nonstrongly convex objectives include important examples such as Lasso
or logistic regression, and sumofnonconvex objectives include famous
examples such as stochastic PCA and is even believed to be related to training
deep neural nets, our results also imply better performances in these
applications.

The diverse world of machine learning applications has given rise to a
plethora of algorithms and optimization methods, finely tuned to the specific
regression or classification task at hand. We reduce the complexity of
algorithm design for machine learning by reductions: we develop reductions that
take a method developed for one setting and apply it to the entire spectrum of
smoothness and strongconvexity in applications.
Furthermore, unlike existing results, our new reductions are OPTIMAL and more
PRACTICAL. We show how these new reductions give rise to new and faster running
times on training linear classifiers for various families of loss functions,
and conclude with experiments showing their successes also in practice.