• ### Tensor Methods for Nonlinear Matrix Completion(1804.10266)

April 26, 2018 cs.LG, stat.ML
In the low rank matrix completion (LRMC) problem, the low rank assumption means that the columns (or rows) of the matrix to be completed are points on a low-dimensional linear algebraic variety. This paper extends this thinking to cases where the columns are points on a low-dimensional nonlinear algebraic variety, a problem we call Low Algebraic Dimension Matrix Completion (LADMC). Matrices whose columns belong to a union of subspaces (UoS) are an important special case. We propose a LADMC algorithm that leverages existing LRMC methods on a tensorized representation of the data. For example, a second-order tensorization representation is formed by taking the outer product of each column with itself, and we consider higher order tensorizations as well. This approach will succeed in many cases where traditional LRMC is guaranteed to fail because the data are low-rank in the tensorized representation but not in the original representation. We also provide a formal mathematical justification for the success of our method. In particular, we show bounds of the rank of these data in the tensorized representation, and we prove sampling requirements to guarantee uniqueness of the solution. Interestingly, the sampling requirements of our LADMC algorithm nearly match the information theoretic lower bounds for matrix completion under a UoS model. We also provide experimental results showing that the new approach significantly outperforms existing state-of-the-art methods for matrix completion in many situations.
• ### Algebraic Variety Models for High-Rank Matrix Completion(1703.09631)

March 28, 2017 stat.ML
We consider a generalization of low-rank matrix completion to the case where the data belongs to an algebraic variety, i.e. each data point is a solution to a system of polynomial equations. In this case the original matrix is possibly high-rank, but it becomes low-rank after mapping each column to a higher dimensional space of monomial features. Many well-studied extensions of linear models, including affine subspaces and their union, can be described by a variety model. In addition, varieties can be used to model a richer class of nonlinear quadratic and higher degree curves and surfaces. We study the sampling requirements for matrix completion under a variety model with a focus on a union of affine subspaces. We also propose an efficient matrix completion algorithm that minimizes a convex or non-convex surrogate of the rank of the matrix of monomial features. Our algorithm uses the well-known "kernel trick" to avoid working directly with the high-dimensional monomial matrix. We show the proposed algorithm is able to recover synthetically generated data up to the predicted sampling complexity bounds. The proposed algorithm also outperforms standard low rank matrix completion and subspace clustering techniques in experiments with real data.
• ### A Characterization of Deterministic Sampling Patterns for Low-Rank Matrix Completion(1503.02596)

Oct. 11, 2016 math.AG, cs.LG, stat.ML
Low-rank matrix completion (LRMC) problems arise in a wide variety of applications. Previous theory mainly provides conditions for completion under missing-at-random samplings. This paper studies deterministic conditions for completion. An incomplete $d \times N$ matrix is finitely rank-$r$ completable if there are at most finitely many rank-$r$ matrices that agree with all its observed entries. Finite completability is the tipping point in LRMC, as a few additional samples of a finitely completable matrix guarantee its unique completability. The main contribution of this paper is a deterministic sampling condition for finite completability. We use this to also derive deterministic sampling conditions for unique completability that can be efficiently verified. We also show that under uniform random sampling schemes, these conditions are satisfied with high probability if $O(\max\{r,\log d\})$ entries per column are observed. These findings have several implications on LRMC regarding lower bounds, sample and computational complexity, the role of coherence, adaptive settings and the validation of any completion algorithm. We complement our theoretical results with experiments that support our findings and motivate future analysis of uncharted sampling regimes.
• ### Deterministic Conditions for Subspace Identifiability from Incomplete Sampling(1410.0633)

May 24, 2015 math.CO, cs.LG, stat.ML
Consider a generic $r$-dimensional subspace of $\mathbb{R}^d$, $r<d$, and suppose that we are only given projections of this subspace onto small subsets of the canonical coordinates. The paper establishes necessary and sufficient deterministic conditions on the subsets for subspace identifiability.
• ### Sparse Estimation with Strongly Correlated Variables using Ordered Weighted L1 Regularization(1409.4005)

Sept. 14, 2014 stat.ML
This paper studies ordered weighted L1 (OWL) norm regularization for sparse estimation problems with strongly correlated variables. We prove sufficient conditions for clustering based on the correlation/colinearity of variables using the OWL norm, of which the so-called OSCAR is a particular case. Our results extend previous ones for OSCAR in several ways: for the squared error loss, our conditions hold for the more general OWL norm and under weaker assumptions; we also establish clustering conditions for the absolute error loss, which is, as far as we know, a novel result. Furthermore, we characterize the statistical performance of OWL norm regularization for generative models in which certain clusters of regression variables are strongly (even perfectly) correlated, but variables in different clusters are uncorrelated. We show that if the true p-dimensional signal generating the data involves only s of the clusters, then O(s log p) samples suffice to accurately estimate the signal, regardless of the number of coefficients within the clusters. The estimation of s-sparse signals with completely independent variables requires just as many measurements. In other words, using the OWL we pay no price (in terms of the number of measurements) for the presence of strongly correlated variables.
• ### Near-Optimal Adaptive Compressed Sensing(1306.6239)

April 29, 2014 cs.IT, math.IT, stat.ML
This paper proposes a simple adaptive sensing and group testing algorithm for sparse signal recovery. The algorithm, termed Compressive Adaptive Sense and Search (CASS), is shown to be near-optimal in that it succeeds at the lowest possible signal-to-noise-ratio (SNR) levels, improving on previous work in adaptive compressed sensing. Like traditional compressed sensing based on random non-adaptive design matrices, the CASS algorithm requires only k log n measurements to recover a k-sparse signal of dimension n. However, CASS succeeds at SNR levels that are a factor log n less than required by standard compressed sensing. From the point of view of constructing and implementing the sensing operation as well as computing the reconstruction, the proposed algorithm is substantially less computationally intensive than standard compressed sensing. CASS is also demonstrated to perform considerably better in practice through simulation. To the best of our knowledge, this is the first demonstration of an adaptive compressed sensing algorithm with near-optimal theoretical guarantees and excellent practical performance. This paper also shows that methods like compressed sensing, group testing, and pooling have an advantage beyond simply reducing the number of measurements or tests -- adaptive versions of such methods can also improve detection and estimation performance when compared to non-adaptive direct (uncompressed) sensing.
• ### Active Learning for Undirected Graphical Model Selection(1404.3418)

April 13, 2014 cs.IT, math.IT, math.ST, stat.TH, stat.ML
This paper studies graphical model selection, i.e., the problem of estimating a graph of statistical relationships among a collection of random variables. Conventional graphical model selection algorithms are passive, i.e., they require all the measurements to have been collected before processing begins. We propose an active learning algorithm that uses junction tree representations to adapt future measurements based on the information gathered from prior measurements. We prove that, under certain conditions, our active learning algorithm requires fewer scalar measurements than any passive algorithm to reliably estimate a graph. A range of numerical results validate our theory and demonstrates the benefits of active learning.
• ### The Geometry of Generalized Binary Search(0910.4397)

June 25, 2013 cs.IT, math.IT, math.ST, stat.TH, stat.ML
This paper investigates the problem of determining a binary-valued function through a sequence of strategically selected queries. The focus is an algorithm called Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued function through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. This paper develops novel incoherence and geometric conditions under which GBS achieves the information-theoretically optimal query complexity; i.e., given a collection of N hypotheses, GBS terminates with the correct function after no more than a constant times log N queries. Furthermore, a noise-tolerant version of GBS is developed that also achieves the optimal query complexity. These results are applied to learning halfspaces, a problem arising routinely in image processing and machine learning.
• ### The Sample Complexity of Search over Multiple Populations(1209.1380)

May 1, 2013 cs.IT, math.IT, stat.ML
This paper studies the sample complexity of searching over multiple populations. We consider a large number of populations, each corresponding to either distribution P0 or P1. The goal of the search problem studied here is to find one population corresponding to distribution P1 with as few samples as possible. The main contribution is to quantify the number of samples needed to correctly find one such population. We consider two general approaches: non-adaptive sampling methods, which sample each population a predetermined number of times until a population following P1 is found, and adaptive sampling methods, which employ sequential sampling schemes for each population. We first derive a lower bound on the number of samples required by any sampling scheme. We then consider an adaptive procedure consisting of a series of sequential probability ratio tests, and show it comes within a constant factor of the lower bound. We give explicit expressions for this constant when samples of the populations follow Gaussian and Bernoulli distributions. An alternative adaptive scheme is discussed which does not require full knowledge of P1, and comes within a constant factor of the optimal scheme. For comparison, a lower bound on the sampling requirements of any non-adaptive scheme is presented.
• ### Query Complexity of Derivative-Free Optimization(1209.2434)

Sept. 11, 2012 cs.LG, stat.ML
This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions. A distinctive feature of the algorithm is that it uses only Boolean-valued function comparisons, rather than function evaluations. This makes the algorithm useful in an even wider range of applications, such as optimization based on paired comparisons from human subjects, for example. We also show that regardless of whether DFO is based on noisy function evaluations or Boolean-valued function comparisons, the convergence rate is the same.
• ### Near-Optimal Compressive Binary Search(1203.1804)

May 8, 2012 cs.IT, math.IT
We propose a simple modification to the recently proposed compressive binary search. The modification removes an unnecessary and suboptimal factor of log log n from the SNR requirement, making the procedure optimal (up to a small constant). Simulations show that the new procedure performs significantly better in practice as well. We also contrast this problem with the more well known problem of noisy binary search.
• ### Active Ranking using Pairwise Comparisons(1109.3701)

Dec. 10, 2011 cs.IT, math.IT, cs.LG, stat.ML
This paper examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of $n$ objects can be identified by standard sorting methods using $n log_2 n$ pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. Specifically, we assume that the objects can be embedded into a $d$-dimensional Euclidean space and that the rankings reflect their relative distances from a common reference point in $R^d$. We show that under this assumption the number of possible rankings grows like $n^{2d}$ and demonstrate an algorithm that can identify a randomly selected ranking using just slightly more than $d log n$ adaptively selected pairwise comparisons, on average. If instead the comparisons are chosen at random, then almost all pairwise comparisons must be made in order to identify any ranking. In addition, we propose a robust, error-tolerant algorithm that only requires that the pairwise comparisons are probably correct. Experimental studies with synthetic and real datasets support the conclusions of our theoretical analysis.
• ### Convex Approaches to Model Wavelet Sparsity Patterns(1104.4385)

April 22, 2011 cs.CV, stat.ML
Statistical dependencies among wavelet coefficients are commonly represented by graphical models such as hidden Markov trees(HMTs). However, in linear inverse problems such as deconvolution, tomography, and compressed sensing, the presence of a sensing or observation matrix produces a linear mixing of the simple Markovian dependency structure. This leads to reconstruction problems that are non-convex optimizations. Past work has dealt with this issue by resorting to greedy or suboptimal iterative reconstruction methods. In this paper, we propose new modeling approaches based on group-sparsity penalties that leads to convex optimizations that can be solved exactly and efficiently. We show that the methods we develop perform significantly better in deconvolution and compressed sensing applications, while being as computationally efficient as standard coefficient-wise approaches such as lasso.
• ### Detecting Weak but Hierarchically-Structured Patterns in Networks(1003.0205)

Feb. 28, 2010 cs.IT, math.IT, math.ST, stat.TH, cs.LG
The ability to detect weak distributed activation patterns in networks is critical to several applications, such as identifying the onset of anomalous activity or incipient congestion in the Internet, or faint traces of a biochemical spread by a sensor network. This is a challenging problem since weak distributed patterns can be invisible in per node statistics as well as a global network-wide aggregate. Most prior work considers situations in which the activation/non-activation of each node is statistically independent, but this is unrealistic in many problems. In this paper, we consider structured patterns arising from statistical dependencies in the activation process. Our contributions are three-fold. First, we propose a sparsifying transform that succinctly represents structured activation patterns that conform to a hierarchical dependency graph. Second, we establish that the proposed transform facilitates detection of very weak activation patterns that cannot be detected with existing methods. Third, we show that the structure of the hierarchical dependency graph governing the activation process, and hence the network transform, can be learnt from very few (logarithmic in network size) independent snapshots of network activity.
• ### Multiscale likelihood analysis and complexity penalized estimation(math/0406424)

June 22, 2004 math.ST, stat.TH
We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L^2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of sufficient conditions for the existence of such factorizations, formulated in analogy to those underlying a standard multiresolution analysis for wavelets, and hence can be viewed as a multiresolution analysis for likelihoods. We then consider the use of these factorizations in the task of nonparametric, complexity penalized likelihood estimation. We study the risk properties of certain thresholding and partitioning estimators, and demonstrate their adaptivity and near-optimality, in a minimax sense over a broad range of function spaces, based on squared Hellinger distance as a loss function. In particular, our results provide an illustration of how properties of classical wavelet-based estimators can be obtained in a single, unified framework that includes models for continuous, count and categorical data types.