
Word embeddings are ubiquitous in NLP and information retrieval, but it is
unclear what they represent when the word is polysemous. Here it is shown that
multiple word senses reside in linear superposition within the word embedding
and simple sparse coding can recover vectors that approximately capture the
senses. The success of our approach, which applies to several embedding
methods, is mathematically explained using a variant of the random walk on
discourses model (Arora et al., 2016). A novel aspect of our technique is that
each extracted word sense is accompanied by one of about 2000 "discourse atoms"
that gives a succinct description of which other words cooccur with that word
sense. Discourse atoms can be of independent interest, and make the method
potentially more useful. Empirical tests are used to verify and support the
theory.

We revisit the question of reducing online learning to approximate
optimization of the offline problem. In this setting, we give two algorithms
with nearoptimal performance in the full information setting: they guarantee
optimal regret and require only polylogarithmically many calls to the
approximation oracle per iteration. Furthermore, these algorithms apply to the
more general improper learning problems. In the bandit setting, our algorithm
also significantly improves the best previously known oracle complexity while
maintaining the same regret.

We propose a reduction for nonconvex optimization that can (1) turn an
stationarypoint finding algorithm into an localminimum finding one, and (2)
replace the Hessianvector product computations with only gradient
computations. It works both in the stochastic and the deterministic settings,
without hurting the algorithm's performance.
As applications, our reduction turns Natasha2 into a firstorder method
without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into
algorithms finding approximate local minima, outperforming some best known
results.

We propose a new secondorder method for geodesically convex optimization on
the natural hyperbolic metric over positive definite matrices. We apply it to
solve the operator scaling problem in time polynomial in the input size and
logarithmic in the error. This is an exponential improvement over previous
algorithms which were analyzed in the usual Euclidean, "commutative" metric
(for which the above problem is not convex). Our method is general and
applicable to other settings.
As a consequence, we solve the equivalence problem for the leftright group
action underlying the operator scaling problem. This yields a deterministic
polynomialtime algorithm for a new class of Polynomial Identity Testing (PIT)
problems, which was the original motivation for studying operator scaling.

We show that the gradient descent algorithm provides an implicit
regularization effect in the learning of overparameterized matrix
factorization models and onehiddenlayer neural networks with quadratic
activations. Concretely, we show that given $\tilde{O}(dr^{2})$ random linear
measurements of a rank $r$ positive semidefinite matrix $X^{\star}$, we can
recover $X^{\star}$ by parameterizing it by $UU^\top$ with $U\in \mathbb
R^{d\times d}$ and minimizing the squared loss, even if $r \ll d$. We prove
that starting from a small initialization, gradient descent recovers
$X^{\star}$ in $\tilde{O}(\sqrt{r})$ iterations approximately. The results
solve the conjecture of Gunasekar et al.'17 under the restricted isometry
property. The technique can be applied to analyzing neural networks with
onehiddenlayer quadratic activations with some technical modifications.

Mixtures of Linear Regressions (MLR) is an important mixture model with many
applications. In this model, each observation is generated from one of the
several unknown linear regression components, where the identity of the
generated component is also unknown. Previous works either assume strong
assumptions on the data distribution or have high complexity. This paper
proposes a fixed parameter tractable algorithm for the problem under general
conditions, which achieves global convergence and the sample complexity scales
nearly linearly in the dimension. In particular, different from previous works
that require the data to be from the standard Gaussian, the algorithm allows
the data from Gaussians with different covariances. When the conditional number
of the covariances and the number of components are fixed, the algorithm has
nearly optimal sample complexity $N = \tilde{O}(d)$ as well as nearly optimal
computational complexity $\tilde{O}(Nd)$, where $d$ is the dimension of the
data space. To the best of our knowledge, this approach provides the first such
recovery guarantee for this general setting.

Stochastic gradient descent (SGD) is widely used in machine learning.
Although being commonly viewed as a fast but not accurate version of gradient
descent (GD), it always finds better solutions than GD for modern neural
networks.
In order to understand this phenomenon, we take an alternative view that SGD
is working on the convolved (thus smoothed) version of the loss function. We
show that, even if the function $f$ has many bad local minima or saddle points,
as long as for every point $x$, the weighted average of the gradients of its
neighborhoods is one point convex with respect to the desired solution $x^*$,
SGD will get close to, and then stay around $x^*$ with constant probability.
More specifically, SGD will not get stuck at "sharp" local minima with small
diameters, as long as the neighborhoods of these regions contain enough
gradient information. The neighborhood size is controlled by step size and
gradient noise.
Our result identifies a set of functions that SGD provably works, which is
much larger than the set of convex functions. Empirically, we observe that the
loss surface of neural networks enjoys nice one point convexity properties
locally, therefore our theorem helps explain why SGD works so well for neural
networks.

Regret bounds in online learning compare the player's performance to $L^*$,
the optimal performance in hindsight with a fixed strategy. Typically such
bounds scale with the square root of the time horizon $T$. The more refined
concept of firstorder regret bound replaces this with a scaling $\sqrt{L^*}$,
which may be much smaller than $\sqrt{T}$. It is well known that minor variants
of standard algorithms satisfy firstorder regret bounds in the full
information and multiarmed bandit settings. In a COLT 2017 open problem,
Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that
existing techniques do not seem sufficient to obtain firstorder regret bounds
for the contextual bandit problem. In the present paper, we resolve this open
problem by presenting a new strategy based on augmenting the policy space.

We propose a rank$k$ variant of the classical FrankWolfe algorithm to solve
convex optimization over a tracenorm ball. Our algorithm replaces the top
singularvector computation ($1$SVD) in FrankWolfe with a top$k$
singularvector computation ($k$SVD), which can be done by repeatedly applying
$1$SVD $k$ times. Alternatively, our algorithm can be viewed as a rank$k$
restricted version of projected gradient descent. We show that our algorithm
has a linear convergence rate when the objective function is smooth and
strongly convex, and the optimal solution has rank at most $k$. This improves
the convergence rate and the total time complexity of the FrankWolfe method
and its variants.

The online problem of computing the top eigenvector is fundamental to machine
learning. In both adversarial and stochastic settings, previous results (such
as matrix multiplicative weight update, follow the regularized leader, follow
the compressed leader, block power method) either achieve optimal regret but
run slow, or run fast at the expense of loosing a $\sqrt{d}$ factor in total
regret where $d$ is the matrix dimension.
We propose a $\textit{followthecompressedleader (FTCL)}$ framework which
achieves optimal regret without sacrificing the running time. Our idea is to
"compress" the matrix strategy to dimension 3 in the adversarial setting, or
dimension 1 in the stochastic setting. These respectively resolve two open
questions regarding the design of optimal and efficient algorithms for the
online eigenvector problem.

We study the active learning problem of top$k$ ranking from multiwise
comparisons under the popular multinomial logit model. Our goal is to identify
the top$k$ items with high probability by adaptively querying sets for
comparisons and observing the noisy output of the most preferred item from each
comparison. To achieve this goal, we design a new active ranking algorithm
without using any information about the underlying items' preference scores. We
also establish a matching lower bound on the sample complexity even when the
set of preference scores is given to the algorithm. These two results together
show that the proposed algorithm is nearly instance optimal (similar to
instance optimal [FLN03], but up to polylog factors). Our work extends the
existing literature on rank aggregation in three directions. First, instead of
studying a static problem with fixed data, we investigate the top$k$ ranking
problem in an active learning setting. Second, we show our algorithm is nearly
instance optimal, which is a much stronger theoretical guarantee. Finally, we
extend the pairwise comparison to the multiwise comparison, which has not been
fully explored in ranking literature.

Nonnegative matrix factorization is a basic tool for decomposing data into
the feature and weight matrices under nonnegativity constraints, and in
practice is often solved in the alternating minimization framework. However, it
is unclear whether such algorithms can recover the groundtruth feature matrix
when the weights for different features are highly correlated, which is common
in applications. This paper proposes a simple and natural alternating gradient
descent based algorithm, and shows that with a mild initialization it provably
recovers the groundtruth in the presence of strong correlations. In most
interesting cases, the correlation can be in the same order as the highest
possible. Our analysis also reveals its several favorable features including
robustness to noise. We complement our theoretical results with empirical
studies on semisynthetic datasets, demonstrating its advantage over several
popular methods in recovering the groundtruth.

In recent years, stochastic gradient descent (SGD) based techniques has
become the standard tools for training neural networks. However, formal
theoretical understanding of why SGD can train neural networks in practice is
largely missing.
In this paper, we make progress on understanding this mystery by providing a
convergence analysis for SGD on a rich subset of twolayer feedforward networks
with ReLU activations. This subset is characterized by a special structure
called "identity mapping". We prove that, if input follows from Gaussian
distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD
converges to the global minimum in polynomial number of steps. Unlike normal
vanilla networks, the "identity mapping" makes our network asymmetric and thus
the global minimum is unique. To complement our theory, we are also able to
show experimentally that multilayer networks with this mapping have better
performance compared with normal vanilla networks.
Our convergence theorem differs from traditional nonconvex optimization
techniques. We show that SGD converges to optimal in "two phases": In phase I,
the gradient points to the wrong direction, however, a potential function $g$
gradually decreases. Then in phase II, SGD enters a nice one point convex
region and converges. We also show that the identity mapping is necessary for
convergence, as it moves the initial point to a better place for optimization.
Experiment verifies our claims.

We solve principal component regression (PCR), up to a multiplicative
accuracy $1+\gamma$, by reducing the problem to $\tilde{O}(\gamma^{1})$
blackbox calls of ridge regression. Therefore, our algorithm does not require
any explicit construction of the top principal components, and is suitable for
largescale PCR instances. In contrast, previous result requires
$\tilde{O}(\gamma^{2})$ such blackbox calls.
We obtain this result by developing a general stable recurrence formula for
matrix Chebyshev polynomials, and a degreeoptimal polynomial approximation to
the matrix sign function. Our techniques may be of independent interests,
especially when designing iterative methods.

We study streaming principal component analysis (PCA), that is to find, in
$O(dk)$ space, the top $k$ eigenvectors of a $d\times d$ hidden matrix $\bf
\Sigma$ with online vectors drawn from covariance matrix $\bf \Sigma$.
We provide $\textit{global}$ convergence for Oja's algorithm which is
popularly used in practice but lacks theoretical understanding for $k>1$. We
also provide a modified variant $\mathsf{Oja}^{++}$ that runs $\textit{even
faster}$ than Oja's. Our results match the information theoretic lower bound in
terms of dependency on error, on eigengap, on rank $k$, and on dimension $d$,
up to polylog factors. In addition, our convergence rate can be made gapfree,
that is proportional to the approximation error and independent of the
eigengap.
In contrast, for general rank $k$, before our work (1) it was open to design
any algorithm with efficient global convergence rate; and (2) it was open to
design any algorithm with (even local) gapfree convergence rate in $O(dk)$
space.

We develop several efficient algorithms for the classical \emph{Matrix
Scaling} problem, which is used in many diverse areas, from preconditioning
linear systems to approximation of the permanent. On an input $n\times n$
matrix $A$, this problem asks to find diagonal (scaling) matrices $X$ and $Y$
(if they exist), so that $X A Y$ $\varepsilon$approximates a doubly
stochastic, or more generally a matrix with prescribed row and column sums.
We address the general scaling problem as well as some important special
cases. In particular, if $A$ has $m$ nonzero entries, and if there exist $X$
and $Y$ with polynomially large entries such that $X A Y$ is doubly stochastic,
then we can solve the problem in total complexity $\tilde{O}(m + n^{4/3})$.
This greatly improves on the best known previous results, which were either
$\tilde{O}(n^4)$ or $O(m n^{1/2}/\varepsilon)$.
Our algorithms are based on tailormade first and second order techniques,
combined with other recent advances in continuous optimization, which may be of
independent interest for solving similar problems.

We study $k$SVD that is to obtain the first $k$ singular vectors of a matrix
$A$. Recently, a few breakthroughs have been discovered on $k$SVD: Musco and
Musco [1] proved the first gapfree convergence result using the block Krylov
method, Shamir [2] discovered the first variancereduction stochastic method,
and Bhojanapalli et al. [3] provided the fastest $O(\mathsf{nnz}(A) +
\mathsf{poly}(1/\varepsilon))$time algorithm using alternating minimization.
In this paper, we put forward a new and simple LazySVD framework to improve
the above breakthroughs. This framework leads to a faster gapfree method
outperforming [1], and the first accelerated and stochastic method
outperforming [2]. In the $O(\mathsf{nnz}(A) + \mathsf{poly}(1/\varepsilon))$
runningtime regime, LazySVD outperforms [3] in certain parameter regimes
without even using alternating minimization.

Many applications require recovering a ground truth lowrank matrix from
noisy observations of the entries, which in practice is typically formulated as
a weighted lowrank approximation problem and solved by nonconvex optimization
heuristics such as alternating minimization. In this paper, we provide provable
recovery guarantee of weighted lowrank via a simple alternating minimization
algorithm. In particular, for a natural class of matrices and weights and
without any assumption on the noise, we bound the spectral norm of the
difference between the recovered matrix and the ground truth, by the spectral
norm of the weighted noise plus an additive error that decreases exponentially
with the number of rounds of alternating minimization, from either
initialization by SVD or, more importantly, random initialization. These
provide the first theoretical results for weighted lowrank via alternating
minimization with nonbinary deterministic weights, significantly generalizing
those for matrix completion, the special case with binary weights, since our
assumptions are similar or weaker than those made in existing works.
Furthermore, this is achieved by a very simple algorithm that improves the
vanilla alternating minimization with a simple clipping step.
The key technical challenge is that under nonbinary deterministic weights,
na\"ive alternating steps will destroy the incoherence and spectral properties
of the intermediate solutions, which are needed for making progress towards the
ground truth. We show that the properties only need to hold in an average sense
and can be achieved by the clipping step.
We further provide an alternating algorithm that uses a whitening step that
keeps the properties via SDP and Rademacher rounding and thus requires weaker
assumptions. This technique can potentially be applied in some other
applications and is of independent interest.

We study $k$GenEV, the problem of finding the top $k$ generalized
eigenvectors, and $k$CCA, the problem of finding the top $k$ vectors in
canonicalcorrelation analysis. We propose algorithms $\mathtt{LazyEV}$ and
$\mathtt{LazyCCA}$ to solve the two problems with running times linearly
dependent on the input size and on $k$.
Furthermore, our algorithms are DOUBLYACCELERATED: our running times depend
only on the square root of the matrix condition number, and on the square root
of the eigengap. This is the first such result for both $k$GenEV or $k$CCA.
We also provide the first gapfree results, which provide running times that
depend on $1/\sqrt{\varepsilon}$ rather than the eigengap.

Nonnegative matrix factorization is a popular tool for decomposing data into
feature and weight matrices under nonnegativity constraints. It enjoys
practical success but is poorly understood theoretically. This paper proposes
an algorithm that alternates between decoding the weights and updating the
features, and shows that assuming a generative model of the data, it provably
recovers the groundtruth under fairly mild conditions. In particular, its only
essential requirement on features is linear independence. Furthermore, the
algorithm uses ReLU to exploit the nonnegativity for decoding the weights, and
thus can tolerate adversarial noise that can potentially be as large as the
signal, and can tolerate unbiased noise much larger than the signal. The
analysis relies on a carefully designed coupling between two potential
functions, which we believe is of independent interest.

Semantic word embeddings represent the meaning of a word via a vector, and
are created by diverse methods. Many use nonlinear operations on cooccurrence
statistics, and have handtuned hyperparameters and reweighting methods.
This paper proposes a new generative model, a dynamic version of the
loglinear topic model of~\citet{mnih2007three}. The methodological novelty is
to use the prior to compute closed form expressions for word statistics. This
provides a theoretical justification for nonlinear models like PMI, word2vec,
and GloVe, as well as some hyperparameter choices. It also helps explain why
lowdimensional semantic embeddings contain linear algebraic structure that
allows solution of word analogies, as shown by~\citet{mikolov2013efficient} and
many subsequent papers.
Experimental support is provided for the generative model assumptions, the
most important of which is that latent word vectors are fairly uniformly
dispersed in space.

The well known maximumentropy principle due to Jaynes, which states that
given mean parameters, the maximum entropy distribution matching them is in an
exponential family, has been very popular in machine learning due to its
"Occam's razor" interpretation. Unfortunately, calculating the potentials in
the maximumentropy distribution is intractable \cite{bresler2014hardness}. We
provide computationally efficient versions of this principle when the mean
parameters are pairwise moments: we design distributions that approximately
match given pairwise moments, while having entropy which is comparable to the
maximum entropy distribution matching those moments.
We additionally provide surprising applications of the approximate maximum
entropy principle to designing provable variational methods for partition
function calculations for Ising models without any assumptions on the
potentials of the model. More precisely, we show that in every temperature, we
can get approximation guarantees for the logpartition function comparable to
those in the lowtemperature limit, which is the setting of optimization of
quadratic forms over the hypercube. \cite{alon2006approximating}

We consider the problem of online convex optimization against an arbitrary
adversary with bandit feedback, known as bandit convex optimization. We give
the first $\tilde{O}(\sqrt{T})$regret algorithm for this setting based on a
novel application of the ellipsoid method to online learning. This bound is
known to be tight up to logarithmic factors. Our analysis introduces new tools
in discrete convex geometry.

Analysis of sample survey data often requires adjustments to account for
missing data in the outcome variables of principal interest. Standard
adjustment methods based on item imputation or on propensity weighting factors
rely heavily on the availability of auxiliary variables for both responding and
nonresponding units. Application of these adjustment methods can be especially
challenging in cases for which the auxiliary variables are numerous and are
themselves subject to substantial incompletedata problems. This paper shows
how classification and regression trees and forests can overcome some of the
computational difficulties. An indepth simulation study based on
incompletedata patterns encountered in the U.S. Consumer Expenditure Survey is
used to compare the methods with two standard methods for estimating a
population mean in terms of bias, mean squared error, computational speed and
number of variables that can be analyzed.

A central problem in ranking is to design a ranking measure for evaluation of
ranking functions. In this paper we study, from a theoretical perspective, the
widely used Normalized Discounted Cumulative Gain (NDCG)type ranking measures.
Although there are extensive empirical studies of NDCG, little is known about
its theoretical properties. We first show that, whatever the ranking function
is, the standard NDCG which adopts a logarithmic discount, converges to 1 as
the number of items to rank goes to infinity. On the first sight, this result
is very surprising. It seems to imply that NDCG cannot differentiate good and
bad ranking functions, contradicting to the empirical success of NDCG in many
applications. In order to have a deeper understanding of ranking measures in
general, we propose a notion referred to as consistent distinguishability. This
notion captures the intuition that a ranking measure should have such a
property: For every pair of substantially different ranking functions, the
ranking measure can decide which one is better in a consistent manner on almost
all datasets. We show that NDCG with logarithmic discount has consistent
distinguishability although it converges to the same limit for all ranking
functions. We next characterize the set of all feasible discount functions for
NDCG according to the concept of consistent distinguishability. Specifically we
show that whether NDCG has consistent distinguishability depends on how fast
the discount decays, and 1/r is a critical point. We then turn to the cutoff
version of NDCG, i.e., NDCG@k. We analyze the distinguishability of NDCG@k for
various choices of k and the discount functions. Experimental results on real
Web search datasets agree well with the theory.