
Word embeddings are ubiquitous in NLP and information retrieval, but it is
unclear what they represent when the word is polysemous. Here it is shown that
multiple word senses reside in linear superposition within the word embedding
and simple sparse coding can recover vectors that approximately capture the
senses. The success of our approach, which applies to several embedding
methods, is mathematically explained using a variant of the random walk on
discourses model (Arora et al., 2016). A novel aspect of our technique is that
each extracted word sense is accompanied by one of about 2000 "discourse atoms"
that gives a succinct description of which other words cooccur with that word
sense. Discourse atoms can be of independent interest, and make the method
potentially more useful. Empirical tests are used to verify and support the
theory.

This work studies the strong duality of nonconvex matrix factorization
problems: we show that under certain dual conditions, these problems and its
dual have the same optimum. This has been well understood for convex
optimization, but little was known for nonconvex problems. We propose a novel
analytical framework and show that under certain dual conditions, the optimal
solution of the matrix factorization program is the same as its bidual and
thus the global optimality of the nonconvex program can be achieved by solving
its bidual which is convex. These dual conditions are satisfied by a wide
class of matrix factorization problems, although matrix factorization problems
are hard to solve in full generality. This analytical framework may be of
independent interest to nonconvex optimization more broadly.
We apply our framework to two prototypical matrix factorization problems:
matrix completion and robust Principal Component Analysis (PCA). These are
examples of efficiently recovering a hidden matrix given limited reliable
observations of it. Our framework shows that exact recoverability and strong
duality hold with nearlyoptimal sample complexity guarantees for matrix
completion and robust PCA.

Mixtures of Linear Regressions (MLR) is an important mixture model with many
applications. In this model, each observation is generated from one of the
several unknown linear regression components, where the identity of the
generated component is also unknown. Previous works either assume strong
assumptions on the data distribution or have high complexity. This paper
proposes a fixed parameter tractable algorithm for the problem under general
conditions, which achieves global convergence and the sample complexity scales
nearly linearly in the dimension. In particular, different from previous works
that require the data to be from the standard Gaussian, the algorithm allows
the data from Gaussians with different covariances. When the conditional number
of the covariances and the number of components are fixed, the algorithm has
nearly optimal sample complexity $N = \tilde{O}(d)$ as well as nearly optimal
computational complexity $\tilde{O}(Nd)$, where $d$ is the dimension of the
data space. To the best of our knowledge, this approach provides the first such
recovery guarantee for this general setting.

We show that training of generative adversarial network (GAN) may not have
good generalization properties; e.g., training may appear successful but the
trained distribution may be far from target distribution in standard metrics.
However, generalization does occur for a weaker metric called neural net
distance. It is also shown that an approximate pure equilibrium exists in the
discriminator/generator game for a special class of generators with natural
training objectives when generator capacity and training set sizes are
moderate.
This existence of equilibrium inspires MIX+GAN protocol, which can be
combined with any existing GAN training, and empirically shown to improve some
of them.

Nonnegative matrix factorization is a basic tool for decomposing data into
the feature and weight matrices under nonnegativity constraints, and in
practice is often solved in the alternating minimization framework. However, it
is unclear whether such algorithms can recover the groundtruth feature matrix
when the weights for different features are highly correlated, which is common
in applications. This paper proposes a simple and natural alternating gradient
descent based algorithm, and shows that with a mild initialization it provably
recovers the groundtruth in the presence of strong correlations. In most
interesting cases, the correlation can be in the same order as the highest
possible. Our analysis also reveals its several favorable features including
robustness to noise. We complement our theoretical results with empirical
studies on semisynthetic datasets, demonstrating its advantage over several
popular methods in recovering the groundtruth.

Several research groups have shown how to correlate fMRI responses to the
meanings of presented stimuli. This paper presents new methods for doing so
when only a natural language annotation is available as the description of the
stimulus. We study fMRI data gathered from subjects watching an episode of BBCs
Sherlock [1], and learn bidirectional mappings between fMRI responses and
natural language representations. We show how to leverage data from multiple
subjects watching the same movie to improve the accuracy of the mappings,
allowing us to succeed at a scene classification task with 72% accuracy (random
guessing would give 4%) and at a scene ranking task with average rank in the
top 4% (random guessing would give 50%). The key ingredients are (a) the use of
the Shared Response Model (SRM) and its variant SRMICA [2, 3] to aggregate
fMRI data from multiple subjects, both of which are shown to be superior to
standard PCA in producing lowdimensional representations for the tasks in this
paper; (b) a sentence embedding technique adapted from the natural language
processing (NLP) literature [4] that produces semantic vector representation of
the annotations; (c) using previous timestep information in the featurization
of the predictor data.

Neural networks are a powerful class of functions that can be trained with
simple gradient descent to achieve stateoftheart performance on a variety of
applications. Despite their practical success, there is a paucity of results
that provide theoretical guarantees on why they are so effective. Lying in the
center of the problem is the difficulty of analyzing the nonconvex loss
function with potentially numerous local minima and saddle points. Can neural
networks corresponding to the stationary points of the loss function learn the
true target function? If yes, what are the key factors contributing to such
nice optimization properties?
In this paper, we answer these questions by analyzing onehiddenlayer neural
networks with ReLU activation, and show that despite the nonconvexity, neural
networks with diverse units have no spurious local minima. We bypass the
nonconvexity issue by directly analyzing the first order optimality condition,
and show that the loss can be made arbitrarily small if the minimum singular
value of the "extended feature matrix" is large enough. We make novel use of
techniques from kernel methods and geometric discrepancy, and identify a new
relation linking the smallest singular value to the spectrum of a kernel
function associated with the activation function and to the diversity of the
units. Our results also suggest a novel regularization function to promote unit
diversity for potentially better generalization.

A typical viral marketing model identifies influential users in a social
network to maximize a single product adoption assuming unlimited user
attention, campaign budgets, and time. In reality, multiple products need
campaigns, users have limited attention, convincing users incurs costs, and
advertisers have limited budgets and expect the adoptions to be maximized soon.
Facing these user, monetary, and timing constraints, we formulate the problem
as a submodular maximization task in a continuoustime diffusion model under
the intersection of a matroid and multiple knapsack constraints. We propose a
randomized algorithm estimating the user influence in a network
($\mathcal{V}$ nodes, $\mathcal{E}$ edges) to an accuracy of $\epsilon$
with $n=\mathcal{O}(1/\epsilon^2)$ randomizations and
$\tilde{\mathcal{O}}(n\mathcal{E}+n\mathcal{V})$ computations. By
exploiting the influence estimation algorithm as a subroutine, we develop an
adaptive threshold greedy algorithm achieving an approximation factor $k_a/(2+2
k)$ of the optimal when $k_a$ out of the $k$ knapsack constraints are active.
Extensive experiments on networks of millions of nodes demonstrate that the
proposed algorithms achieve the stateoftheart in terms of effectiveness and
scalability.

Motivated by the fact that distances between data points in many realworld
clustering instances are often based on heuristic measures, Bilu and
Linial~\cite{BL} proposed analyzing objective based clustering problems under
the assumption that the optimum clustering to the objective is preserved under
small multiplicative perturbations to distances between points. The hope is
that by exploiting the structure in such instances, one can overcome worst case
hardness results.
In this paper, we provide several results within this framework. For
centerbased objectives, we present an algorithm that can optimally cluster
instances resilient to perturbations of factor $(1 + \sqrt{2})$, solving an
open problem of Awasthi et al.~\cite{ABS10}. For $k$median, a centerbased
objective of special interest, we additionally give algorithms for a more
relaxed assumption in which we allow the optimal solution to change in a small
$\epsilon$ fraction of the points after perturbation. We give the first bounds
known for $k$median under this more realistic and more general assumption. We
also provide positive results for minsum clustering which is typically a
harder objective than centerbased objectives from approximability standpoint.
Our algorithms are based on new linkage criteria that may be of independent
interest.
Additionally, we give sublineartime algorithms, showing algorithms that can
return an implicit clustering from only access to a small random sample.

Many applications require recovering a ground truth lowrank matrix from
noisy observations of the entries, which in practice is typically formulated as
a weighted lowrank approximation problem and solved by nonconvex optimization
heuristics such as alternating minimization. In this paper, we provide provable
recovery guarantee of weighted lowrank via a simple alternating minimization
algorithm. In particular, for a natural class of matrices and weights and
without any assumption on the noise, we bound the spectral norm of the
difference between the recovered matrix and the ground truth, by the spectral
norm of the weighted noise plus an additive error that decreases exponentially
with the number of rounds of alternating minimization, from either
initialization by SVD or, more importantly, random initialization. These
provide the first theoretical results for weighted lowrank via alternating
minimization with nonbinary deterministic weights, significantly generalizing
those for matrix completion, the special case with binary weights, since our
assumptions are similar or weaker than those made in existing works.
Furthermore, this is achieved by a very simple algorithm that improves the
vanilla alternating minimization with a simple clipping step.
The key technical challenge is that under nonbinary deterministic weights,
na\"ive alternating steps will destroy the incoherence and spectral properties
of the intermediate solutions, which are needed for making progress towards the
ground truth. We show that the properties only need to hold in an average sense
and can be achieved by the clipping step.
We further provide an alternating algorithm that uses a whitening step that
keeps the properties via SDP and Rademacher rounding and thus requires weaker
assumptions. This technique can potentially be applied in some other
applications and is of independent interest.

Nonnegative matrix factorization is a popular tool for decomposing data into
feature and weight matrices under nonnegativity constraints. It enjoys
practical success but is poorly understood theoretically. This paper proposes
an algorithm that alternates between decoding the weights and updating the
features, and shows that assuming a generative model of the data, it provably
recovers the groundtruth under fairly mild conditions. In particular, its only
essential requirement on features is linear independence. Furthermore, the
algorithm uses ReLU to exploit the nonnegativity for decoding the weights, and
thus can tolerate adversarial noise that can potentially be as large as the
signal, and can tolerate unbiased noise much larger than the signal. The
analysis relies on a carefully designed coupling between two potential
functions, which we believe is of independent interest.

Semantic word embeddings represent the meaning of a word via a vector, and
are created by diverse methods. Many use nonlinear operations on cooccurrence
statistics, and have handtuned hyperparameters and reweighting methods.
This paper proposes a new generative model, a dynamic version of the
loglinear topic model of~\citet{mnih2007three}. The methodological novelty is
to use the prior to compute closed form expressions for word statistics. This
provides a theoretical justification for nonlinear models like PMI, word2vec,
and GloVe, as well as some hyperparameter choices. It also helps explain why
lowdimensional semantic embeddings contain linear algebraic structure that
allows solution of word analogies, as shown by~\citet{mikolov2013efficient} and
many subsequent papers.
Experimental support is provided for the generative model assumptions, the
most important of which is that latent word vectors are fairly uniformly
dispersed in space.

Kernel Principal Component Analysis (KPCA) is a key machine learning
algorithm for extracting nonlinear features from data. In the presence of a
large volume of high dimensional data collected in a distributed fashion, it
becomes very costly to communicate all of this data to a single data center and
then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a
distributed and communication efficient fashion while maintaining provable and
strong guarantees in solution quality?
In this paper, we give an affirmative answer to the question by developing a
communication efficient algorithm to perform kernel PCA in the distributed
setting. The algorithm is a clever combination of subspace embedding and
adaptive sampling techniques, and we show that the algorithm can take as input
an arbitrary configuration of distributed datasets, and compute a set of global
kernel principal components with relative error guarantees independent of the
dimension of the feature space or the total number of data points. In
particular, computing $k$ principal components with relative error $\epsilon$
over $s$ workers has communication cost $\tilde{O}(s \rho k/\epsilon+s
k^2/\epsilon^3)$ words, where $\rho$ is the average number of nonzero entries
in each data point. Furthermore, we experimented the algorithm with largescale
real world datasets and showed that the algorithm produces a high quality
kernel PCA solution while using significantly less communication than
alternative approaches.

Nonlinear component analysis such as kernel Principle Component Analysis
(KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in
machine learning, statistics and data analysis, but they can not scale up to
big datasets. Recent attempts have employed random feature approximations to
convert the problem to the primal form for linear computational complexity.
However, to obtain high quality solutions, the number of random features should
be the same order of magnitude as the number of data points, making such
approach not directly applicable to the regime with millions of data points.
We propose a simple, computationally efficient, and memory friendly algorithm
based on the "doubly stochastic gradients" to scale up a range of kernel
nonlinear component analysis, such as kernel PCA, CCA and SVD. Despite the
\emph{nonconvex} nature of these problems, our method enjoys theoretical
guarantees that it converges at the rate $\tilde{O}(1/t)$ to the global
optimum, even for the top $k$ eigen subspace. Unlike many alternatives, our
algorithm does not require explicit orthogonalization, which is infeasible on
big datasets. We demonstrate the effectiveness and scalability of our algorithm
on large scale synthetic and real world datasets.

Generative models for deep learning are promising both to improve
understanding of the model, and yield training methods requiring fewer labeled
samples.
Recent works use generative model approaches to produce the deep net's input
given the value of a hidden layer several levels above. However, there is no
accompanying "proof of correctness" for the generative model, showing that the
feedforward deep net is the correct inference method for recovering the hidden
layer given the input. Furthermore, these models are complicated.
The current paper takes a more theoretical tack. It presents a very simple
generative model for RELU deep nets, with the following characteristics: (i)
The generative model is just the reverse of the feedforward net: if the forward
transformation at a layer is $A$ then the reverse transformation is $A^T$.
(This can be seen as an explanation of the old weight tying idea for denoising
autoencoders.) (ii) Its correctness can be proven under a clean theoretical
assumption: the edge weights in reallife deep nets behave like random numbers.
Under this assumption which is experimentally tested on reallife nets like
AlexNet it is formally proved that feed forward net is a correct inference
method for recovering the hidden layer.
The generative model suggests a simple modification for training: use the
generative model to produce synthetic data with labels and include it in the
training set. Experiments are shown to support this theory of randomlike deep
nets; and that it helps the training.

The general perception is that kernel methods are not scalable, and neural
nets are the methods of choice for nonlinear learning problems. Or have we
simply not tried hard enough for kernel methods? Here we propose an approach
that scales up kernel methods using a novel concept called "doubly stochastic
functional gradients". Our approach relies on the fact that many kernel methods
can be expressed as convex optimization problems, and we solve the problems by
making two unbiased stochastic approximations to the functional gradient, one
using random training points and another using random functions associated with
the kernel, and then descending using this noisy functional gradient. We show
that a function produced by this procedure after $t$ iterations converges to
the optimal function in the reproducing kernel Hilbert space in rate $O(1/t)$,
and achieves a generalization performance of $O(1/\sqrt{t})$. This doubly
stochasticity also allows us to avoid keeping the support vectors and to
implement the algorithm in a small memory footprint, which is linear in number
of iterations and independent of data dimension. Our approach can readily scale
kernel methods up to the regimes which are dominated by neural nets. We show
that our method can achieve competitive performance to neural nets in datasets
such as 8 million handwritten digits from MNIST, 2.3 million energy materials
from MolecularSpace, and 1 million photos from ImageNet.

Learning sparse combinations is a frequent theme in machine learning. In this
paper, we study its associated optimization problem in the distributed setting
where the elements to be combined are not centrally located but spread over a
network. We address the key challenges of balancing communication costs and
optimization errors. To this end, we propose a distributed FrankWolfe (dFW)
algorithm. We obtain theoretical guarantees on the optimization error
$\epsilon$ and communication cost that do not depend on the total number of
combining elements. We further show that the communication cost of dFW is
optimal by deriving a lowerbound on the communication cost required to
construct an $\epsilon$approximate solution. We validate our theoretical
analysis with empirical studies on synthetic and realworld data, which
demonstrate that dFW outperforms both baselines and competing methods. We also
study the performance of dFW when the conditions of our analysis are relaxed,
and show that dFW is fairly robust.

We study the distributed computing setting in which there are multiple
servers, each holding a set of points, who wish to compute functions on the
union of their point sets. A key task in this setting is Principal Component
Analysis (PCA), in which the servers would like to compute a low dimensional
subspace capturing as much of the variance of the union of their point sets as
possible. Given a procedure for approximate PCA, one can use it to
approximately solve $\ell_2$error fitting problems such as $k$means
clustering and subspace clustering. The essential properties of an approximate
distributed PCA algorithm are its communication cost and computational
efficiency for a given desired accuracy in downstream applications. We give new
algorithms and analyses for distributed PCA which lead to improved
communication and computational costs for $k$means clustering and related
problems. Our empirical study on real world data shows a speedup of orders of
magnitude, preserving communication with only a negligible degradation in
solution quality. Some of these techniques we develop, such as a general
transformation from a constant success probability subspace embedding to a high
success probability subspace embedding with a dimension and sparsity
independent of the success probability, may be of independent interest.

One of the most widely used techniques for data clustering is agglomerative
clustering. Such algorithms have been long used across many different fields
ranging from computational biology to social sciences to computer vision in
part because their output is easy to interpret. Unfortunately, it is well
known, however, that many of the classic agglomerative clustering algorithms
are not robust to noise. In this paper we propose and analyze a new robust
algorithm for bottomup agglomerative clustering. We show that our algorithm
can be used to cluster accurately in cases where the data satisfies a number of
natural properties and where the traditional agglomerative algorithms fail. We
also show how to adapt our algorithm to the inductive setting where our given
data is only a small random sample of the entire data set. Experimental
evaluations on synthetic and real world data sets show that our algorithm
achieves better performance than other hierarchical algorithms in the presence
of noise.

The typical algorithmic problem in viral marketing aims to identify a set of
influential users in a social network, who, when convinced to adopt a product,
shall influence other users in the network and trigger a large cascade of
adoptions. However, the host (the owner of an online social platform) often
faces more constraints than a single product, endless user attentions,
unlimited budget and unbounded time; in reality, multiple products need to be
advertised, each user can tolerate only a small number of recommendations,
influencing user has a cost and advertisers have only limited budgets, and the
adoptions need to be maximized within a short time window.
Given theses myriads of user, monetary, and timing constraints, it is
extremely challenging for the host to design principled and efficient viral
market algorithms with provable guarantees. In this paper, we provide a novel
solution by formulating the problem as a submodular maximization in a
continuoustime diffusion model under an intersection of a matroid and multiple
knapsack constraints. We also propose an adaptive threshold greedy algorithm
which can be faster than the traditional greedy algorithm with lazy evaluation,
and scalable to networks with million of nodes. Furthermore, our mathematical
formulation allows us to prove that the algorithm can achieve an approximation
factor of $k_a/(2+2 k)$ when $k_a$ out of the $k$ knapsack constraints are
active, which also improves over previous guarantees from combinatorial
optimization literature. In the case when influencing each user has uniform
cost, the approximation becomes even better to a factor of $1/3$. Extensive
synthetic and real world experiments demonstrate that our budgeted influence
maximization algorithm achieves thestateoftheart in terms of both
effectiveness and scalability, often beating the next best by significant
margins.

This paper provides new algorithms for distributed clustering for two popular
centerbased objectives, kmedian and kmeans. These algorithms have provable
guarantees and improve communication complexity over existing approaches.
Following a classic approach in clustering by \cite{har2004coresets}, we reduce
the problem of finding a clustering with low cost to the problem of finding a
coreset of small size. We provide a distributed method for constructing a
global coreset which improves over the previous methods by reducing the
communication complexity, and which works over general communication
topologies. Experimental results on large scale data sets show that this
approach outperforms other coresetbased distributed clustering algorithms.