• Word embeddings are ubiquitous in NLP and information retrieval, but it is unclear what they represent when the word is polysemous. Here it is shown that multiple word senses reside in linear superposition within the word embedding and simple sparse coding can recover vectors that approximately capture the senses. The success of our approach, which applies to several embedding methods, is mathematically explained using a variant of the random walk on discourses model (Arora et al., 2016). A novel aspect of our technique is that each extracted word sense is accompanied by one of about 2000 "discourse atoms" that gives a succinct description of which other words co-occur with that word sense. Discourse atoms can be of independent interest, and make the method potentially more useful. Empirical tests are used to verify and support the theory.
  • This work studies the strong duality of non-convex matrix factorization problems: we show that under certain dual conditions, these problems and its dual have the same optimum. This has been well understood for convex optimization, but little was known for non-convex problems. We propose a novel analytical framework and show that under certain dual conditions, the optimal solution of the matrix factorization program is the same as its bi-dual and thus the global optimality of the non-convex program can be achieved by solving its bi-dual which is convex. These dual conditions are satisfied by a wide class of matrix factorization problems, although matrix factorization problems are hard to solve in full generality. This analytical framework may be of independent interest to non-convex optimization more broadly. We apply our framework to two prototypical matrix factorization problems: matrix completion and robust Principal Component Analysis (PCA). These are examples of efficiently recovering a hidden matrix given limited reliable observations of it. Our framework shows that exact recoverability and strong duality hold with nearly-optimal sample complexity guarantees for matrix completion and robust PCA.
  • Mixtures of Linear Regressions (MLR) is an important mixture model with many applications. In this model, each observation is generated from one of the several unknown linear regression components, where the identity of the generated component is also unknown. Previous works either assume strong assumptions on the data distribution or have high complexity. This paper proposes a fixed parameter tractable algorithm for the problem under general conditions, which achieves global convergence and the sample complexity scales nearly linearly in the dimension. In particular, different from previous works that require the data to be from the standard Gaussian, the algorithm allows the data from Gaussians with different covariances. When the conditional number of the covariances and the number of components are fixed, the algorithm has nearly optimal sample complexity $N = \tilde{O}(d)$ as well as nearly optimal computational complexity $\tilde{O}(Nd)$, where $d$ is the dimension of the data space. To the best of our knowledge, this approach provides the first such recovery guarantee for this general setting.
  • We show that training of generative adversarial network (GAN) may not have good generalization properties; e.g., training may appear successful but the trained distribution may be far from target distribution in standard metrics. However, generalization does occur for a weaker metric called neural net distance. It is also shown that an approximate pure equilibrium exists in the discriminator/generator game for a special class of generators with natural training objectives when generator capacity and training set sizes are moderate. This existence of equilibrium inspires MIX+GAN protocol, which can be combined with any existing GAN training, and empirically shown to improve some of them.
  • Non-negative matrix factorization is a basic tool for decomposing data into the feature and weight matrices under non-negativity constraints, and in practice is often solved in the alternating minimization framework. However, it is unclear whether such algorithms can recover the ground-truth feature matrix when the weights for different features are highly correlated, which is common in applications. This paper proposes a simple and natural alternating gradient descent based algorithm, and shows that with a mild initialization it provably recovers the ground-truth in the presence of strong correlations. In most interesting cases, the correlation can be in the same order as the highest possible. Our analysis also reveals its several favorable features including robustness to noise. We complement our theoretical results with empirical studies on semi-synthetic datasets, demonstrating its advantage over several popular methods in recovering the ground-truth.
  • Several research groups have shown how to correlate fMRI responses to the meanings of presented stimuli. This paper presents new methods for doing so when only a natural language annotation is available as the description of the stimulus. We study fMRI data gathered from subjects watching an episode of BBCs Sherlock [1], and learn bidirectional mappings between fMRI responses and natural language representations. We show how to leverage data from multiple subjects watching the same movie to improve the accuracy of the mappings, allowing us to succeed at a scene classification task with 72% accuracy (random guessing would give 4%) and at a scene ranking task with average rank in the top 4% (random guessing would give 50%). The key ingredients are (a) the use of the Shared Response Model (SRM) and its variant SRM-ICA [2, 3] to aggregate fMRI data from multiple subjects, both of which are shown to be superior to standard PCA in producing low-dimensional representations for the tasks in this paper; (b) a sentence embedding technique adapted from the natural language processing (NLP) literature [4] that produces semantic vector representation of the annotations; (c) using previous timestep information in the featurization of the predictor data.
  • Neural networks are a powerful class of functions that can be trained with simple gradient descent to achieve state-of-the-art performance on a variety of applications. Despite their practical success, there is a paucity of results that provide theoretical guarantees on why they are so effective. Lying in the center of the problem is the difficulty of analyzing the non-convex loss function with potentially numerous local minima and saddle points. Can neural networks corresponding to the stationary points of the loss function learn the true target function? If yes, what are the key factors contributing to such nice optimization properties? In this paper, we answer these questions by analyzing one-hidden-layer neural networks with ReLU activation, and show that despite the non-convexity, neural networks with diverse units have no spurious local minima. We bypass the non-convexity issue by directly analyzing the first order optimality condition, and show that the loss can be made arbitrarily small if the minimum singular value of the "extended feature matrix" is large enough. We make novel use of techniques from kernel methods and geometric discrepancy, and identify a new relation linking the smallest singular value to the spectrum of a kernel function associated with the activation function and to the diversity of the units. Our results also suggest a novel regularization function to promote unit diversity for potentially better generalization.
  • A typical viral marketing model identifies influential users in a social network to maximize a single product adoption assuming unlimited user attention, campaign budgets, and time. In reality, multiple products need campaigns, users have limited attention, convincing users incurs costs, and advertisers have limited budgets and expect the adoptions to be maximized soon. Facing these user, monetary, and timing constraints, we formulate the problem as a submodular maximization task in a continuous-time diffusion model under the intersection of a matroid and multiple knapsack constraints. We propose a randomized algorithm estimating the user influence in a network ($|\mathcal{V}|$ nodes, $|\mathcal{E}|$ edges) to an accuracy of $\epsilon$ with $n=\mathcal{O}(1/\epsilon^2)$ randomizations and $\tilde{\mathcal{O}}(n|\mathcal{E}|+n|\mathcal{V}|)$ computations. By exploiting the influence estimation algorithm as a subroutine, we develop an adaptive threshold greedy algorithm achieving an approximation factor $k_a/(2+2 k)$ of the optimal when $k_a$ out of the $k$ knapsack constraints are active. Extensive experiments on networks of millions of nodes demonstrate that the proposed algorithms achieve the state-of-the-art in terms of effectiveness and scalability.
  • Motivated by the fact that distances between data points in many real-world clustering instances are often based on heuristic measures, Bilu and Linial~\cite{BL} proposed analyzing objective based clustering problems under the assumption that the optimum clustering to the objective is preserved under small multiplicative perturbations to distances between points. The hope is that by exploiting the structure in such instances, one can overcome worst case hardness results. In this paper, we provide several results within this framework. For center-based objectives, we present an algorithm that can optimally cluster instances resilient to perturbations of factor $(1 + \sqrt{2})$, solving an open problem of Awasthi et al.~\cite{ABS10}. For $k$-median, a center-based objective of special interest, we additionally give algorithms for a more relaxed assumption in which we allow the optimal solution to change in a small $\epsilon$ fraction of the points after perturbation. We give the first bounds known for $k$-median under this more realistic and more general assumption. We also provide positive results for min-sum clustering which is typically a harder objective than center-based objectives from approximability standpoint. Our algorithms are based on new linkage criteria that may be of independent interest. Additionally, we give sublinear-time algorithms, showing algorithms that can return an implicit clustering from only access to a small random sample.
  • Many applications require recovering a ground truth low-rank matrix from noisy observations of the entries, which in practice is typically formulated as a weighted low-rank approximation problem and solved by non-convex optimization heuristics such as alternating minimization. In this paper, we provide provable recovery guarantee of weighted low-rank via a simple alternating minimization algorithm. In particular, for a natural class of matrices and weights and without any assumption on the noise, we bound the spectral norm of the difference between the recovered matrix and the ground truth, by the spectral norm of the weighted noise plus an additive error that decreases exponentially with the number of rounds of alternating minimization, from either initialization by SVD or, more importantly, random initialization. These provide the first theoretical results for weighted low-rank via alternating minimization with non-binary deterministic weights, significantly generalizing those for matrix completion, the special case with binary weights, since our assumptions are similar or weaker than those made in existing works. Furthermore, this is achieved by a very simple algorithm that improves the vanilla alternating minimization with a simple clipping step. The key technical challenge is that under non-binary deterministic weights, na\"ive alternating steps will destroy the incoherence and spectral properties of the intermediate solutions, which are needed for making progress towards the ground truth. We show that the properties only need to hold in an average sense and can be achieved by the clipping step. We further provide an alternating algorithm that uses a whitening step that keeps the properties via SDP and Rademacher rounding and thus requires weaker assumptions. This technique can potentially be applied in some other applications and is of independent interest.
  • Non-negative matrix factorization is a popular tool for decomposing data into feature and weight matrices under non-negativity constraints. It enjoys practical success but is poorly understood theoretically. This paper proposes an algorithm that alternates between decoding the weights and updating the features, and shows that assuming a generative model of the data, it provably recovers the ground-truth under fairly mild conditions. In particular, its only essential requirement on features is linear independence. Furthermore, the algorithm uses ReLU to exploit the non-negativity for decoding the weights, and thus can tolerate adversarial noise that can potentially be as large as the signal, and can tolerate unbiased noise much larger than the signal. The analysis relies on a carefully designed coupling between two potential functions, which we believe is of independent interest.
  • Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods. This paper proposes a new generative model, a dynamic version of the log-linear topic model of~\citet{mnih2007three}. The methodological novelty is to use the prior to compute closed form expressions for word statistics. This provides a theoretical justification for nonlinear models like PMI, word2vec, and GloVe, as well as some hyperparameter choices. It also helps explain why low-dimensional semantic embeddings contain linear algebraic structure that allows solution of word analogies, as shown by~\citet{mikolov2013efficient} and many subsequent papers. Experimental support is provided for the generative model assumptions, the most important of which is that latent word vectors are fairly uniformly dispersed in space.
  • Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing $k$ principal components with relative error $\epsilon$ over $s$ workers has communication cost $\tilde{O}(s \rho k/\epsilon+s k^2/\epsilon^3)$ words, where $\rho$ is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets and showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches.
  • Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, but they can not scale up to big datasets. Recent attempts have employed random feature approximations to convert the problem to the primal form for linear computational complexity. However, to obtain high quality solutions, the number of random features should be the same order of magnitude as the number of data points, making such approach not directly applicable to the regime with millions of data points. We propose a simple, computationally efficient, and memory friendly algorithm based on the "doubly stochastic gradients" to scale up a range of kernel nonlinear component analysis, such as kernel PCA, CCA and SVD. Despite the \emph{non-convex} nature of these problems, our method enjoys theoretical guarantees that it converges at the rate $\tilde{O}(1/t)$ to the global optimum, even for the top $k$ eigen subspace. Unlike many alternatives, our algorithm does not require explicit orthogonalization, which is infeasible on big datasets. We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets.
  • Generative models for deep learning are promising both to improve understanding of the model, and yield training methods requiring fewer labeled samples. Recent works use generative model approaches to produce the deep net's input given the value of a hidden layer several levels above. However, there is no accompanying "proof of correctness" for the generative model, showing that the feedforward deep net is the correct inference method for recovering the hidden layer given the input. Furthermore, these models are complicated. The current paper takes a more theoretical tack. It presents a very simple generative model for RELU deep nets, with the following characteristics: (i) The generative model is just the reverse of the feedforward net: if the forward transformation at a layer is $A$ then the reverse transformation is $A^T$. (This can be seen as an explanation of the old weight tying idea for denoising autoencoders.) (ii) Its correctness can be proven under a clean theoretical assumption: the edge weights in real-life deep nets behave like random numbers. Under this assumption ---which is experimentally tested on real-life nets like AlexNet--- it is formally proved that feed forward net is a correct inference method for recovering the hidden layer. The generative model suggests a simple modification for training: use the generative model to produce synthetic data with labels and include it in the training set. Experiments are shown to support this theory of random-like deep nets; and that it helps the training.
  • The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random functions associated with the kernel, and then descending using this noisy functional gradient. We show that a function produced by this procedure after $t$ iterations converges to the optimal function in the reproducing kernel Hilbert space in rate $O(1/t)$, and achieves a generalization performance of $O(1/\sqrt{t})$. This doubly stochasticity also allows us to avoid keeping the support vectors and to implement the algorithm in a small memory footprint, which is linear in number of iterations and independent of data dimension. Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 8 million handwritten digits from MNIST, 2.3 million energy materials from MolecularSpace, and 1 million photos from ImageNet.
  • Learning sparse combinations is a frequent theme in machine learning. In this paper, we study its associated optimization problem in the distributed setting where the elements to be combined are not centrally located but spread over a network. We address the key challenges of balancing communication costs and optimization errors. To this end, we propose a distributed Frank-Wolfe (dFW) algorithm. We obtain theoretical guarantees on the optimization error $\epsilon$ and communication cost that do not depend on the total number of combining elements. We further show that the communication cost of dFW is optimal by deriving a lower-bound on the communication cost required to construct an $\epsilon$-approximate solution. We validate our theoretical analysis with empirical studies on synthetic and real-world data, which demonstrate that dFW outperforms both baselines and competing methods. We also study the performance of dFW when the conditions of our analysis are relaxed, and show that dFW is fairly robust.
  • We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve $\ell_2$-error fitting problems such as $k$-means clustering and subspace clustering. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for $k$-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as a general transformation from a constant success probability subspace embedding to a high success probability subspace embedding with a dimension and sparsity independent of the success probability, may be of independent interest.
  • One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across many different fields ranging from computational biology to social sciences to computer vision in part because their output is easy to interpret. Unfortunately, it is well known, however, that many of the classic agglomerative clustering algorithms are not robust to noise. In this paper we propose and analyze a new robust algorithm for bottom-up agglomerative clustering. We show that our algorithm can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also show how to adapt our algorithm to the inductive setting where our given data is only a small random sample of the entire data set. Experimental evaluations on synthetic and real world data sets show that our algorithm achieves better performance than other hierarchical algorithms in the presence of noise.
  • The typical algorithmic problem in viral marketing aims to identify a set of influential users in a social network, who, when convinced to adopt a product, shall influence other users in the network and trigger a large cascade of adoptions. However, the host (the owner of an online social platform) often faces more constraints than a single product, endless user attentions, unlimited budget and unbounded time; in reality, multiple products need to be advertised, each user can tolerate only a small number of recommendations, influencing user has a cost and advertisers have only limited budgets, and the adoptions need to be maximized within a short time window. Given theses myriads of user, monetary, and timing constraints, it is extremely challenging for the host to design principled and efficient viral market algorithms with provable guarantees. In this paper, we provide a novel solution by formulating the problem as a submodular maximization in a continuous-time diffusion model under an intersection of a matroid and multiple knapsack constraints. We also propose an adaptive threshold greedy algorithm which can be faster than the traditional greedy algorithm with lazy evaluation, and scalable to networks with million of nodes. Furthermore, our mathematical formulation allows us to prove that the algorithm can achieve an approximation factor of $k_a/(2+2 k)$ when $k_a$ out of the $k$ knapsack constraints are active, which also improves over previous guarantees from combinatorial optimization literature. In the case when influencing each user has uniform cost, the approximation becomes even better to a factor of $1/3$. Extensive synthetic and real world experiments demonstrate that our budgeted influence maximization algorithm achieves the-state-of-the-art in terms of both effectiveness and scalability, often beating the next best by significant margins.
  • This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by \cite{har2004coresets}, we reduce the problem of finding a clustering with low cost to the problem of finding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms.