• ### Indirect Gaussian Graph Learning beyond Gaussianity(1610.02590)

Jan. 13, 2019 stat.ME, stat.ML
This paper studies how to capture dependency graph structures from real data which may not be multivariate Gaussian. Starting from marginal loss functions not necessarily derived from probability distributions, we utilize an additive over-parametrization with shrinkage to incorporate variable dependencies into the criterion. An iterative Gaussian graph learning algorithm is proposed with ease in implementation. Statistical analysis shows that the estimators achieve satisfactory accuracy with the error measured in terms of a proper Bregman divergence. Real-life examples in different settings are given to demonstrate the efficacy of the proposed methodology.
• ### Iterative proportional scaling revisited: a modern optimization perspective(1610.02588)

July 2, 2018 stat.CO, stat.ML
This paper revisits the classic iterative proportional scaling (IPS) from a modern optimization perspective. In contrast to the criticisms made in the literature, we show that based on a coordinate descent characterization, IPS can be slightly modified to deliver coefficient estimates, and from a majorization-minimization standpoint, IPS can be extended to handle log-affine models with features not necessarily binary-valued or nonnegative. Furthermore, some state-of-the-art optimization techniques such as block-wise computation, randomization and momentum-based acceleration can be employed to provide more scalable IPS algorithms, as well as some regularized variants of IPS for concurrent feature selection.
• ### Robust Reduced Rank Regression(1509.03938)

July 15, 2017 math.ST, stat.TH, stat.ME, stat.AP
In high-dimensional multivariate regression problems, enforcing low rank in the coefficient matrix offers effective dimension reduction, which greatly facilitates parameter estimation and model interpretation. However, commonly-used reduced-rank methods are sensitive to data corruption, as the low-rank dependence structure between response variables and predictors is easily distorted by outliers. We propose a robust reduced-rank regression approach for joint modeling and outlier detection. The problem is formulated as a regularized multivariate regression with a sparse mean-shift parametrization, which generalizes and unifies some popular robust multivariate methods. An efficient thresholding-based iterative procedure is developed for optimization. We show that the algorithm is guaranteed to converge, and the coordinatewise minimum point produced is statistically accurate under regularity conditions. Our theoretical investigations focus on nonasymptotic robust analysis, which demonstrates that joint rank reduction and outlier detection leads to improved prediction accuracy. In particular, we show that redescending $\psi$-functions can essentially attain the minimax optimal error rate, and in some less challenging problems convex regularization guarantees the same low error rate. The performance of the proposed method is examined by simulation studies and real data examples.
• ### Group Regularized Estimation under Structural Hierarchy(1411.4691)

Nov. 8, 2016 stat.CO, math.ST, stat.TH, stat.ML
Variable selection for models including interactions between explanatory variables often needs to obey certain hierarchical constraints. The weak or strong structural hierarchy requires that the existence of an interaction term implies at least one or both associated main effects to be present in the model. Lately, this problem has attracted a lot of attention, but existing computational algorithms converge slow even with a moderate number of predictors. Moreover, in contrast to the rich literature on ordinary variable selection, there is a lack of statistical theory to show reasonably low error rates of hierarchical variable selection. This work investigates a new class of estimators that make use of multiple group penalties to capture structural parsimony. We give the minimax lower bounds for strong and weak hierarchical variable selection and show that the proposed estimators enjoy sharp rate oracle inequalities. A general-purpose algorithm is developed with guaranteed convergence and global optimality. Simulations and real data experiments demonstrate the efficiency and efficacy of the proposed approach.
• ### Selective Factor Extraction in High Dimensions(1403.6212)

Oct. 25, 2016 stat.ME, stat.ML
This paper studies simultaneous feature selection and extraction in supervised and unsupervised learning. We propose and investigate selective reduced rank regression for constructing optimal explanatory factors from a parsimonious subset of input features. The proposed estimators enjoy sharp oracle inequalities, and with a predictive information criterion for model selection, they adapt to unknown sparsity by controlling both rank and row support of the coefficient matrix. A class of algorithms is developed that can accommodate various convex and nonconvex sparsity-inducing penalties, and can be used for rank-constrained variable screening in high-dimensional multivariate data. The paper also showcases applications in macroeconomics and computer vision to demonstrate how low-dimensional data structures can be effectively captured by joint variable selection and projection.
• ### On the Finite-Sample Analysis of $\Theta$-estimators(1512.03987)

Oct. 8, 2016 math.ST, stat.TH
In large-scale modern data analysis, first-order optimization methods are usually favored to obtain sparse estimators in high dimensions. This paper performs theoretical analysis of a class of iterative thresholding based estimators defined in this way. Oracle inequalities are built to show the nearly minimax rate optimality of such estimators under a new type of regularity conditions. Moreover, the sequence of iterates is found to be able to approach the statistical truth within the best statistical accuracy geometrically fast. Our results also reveal different benefits brought by convex and nonconvex types of shrinkage.
• ### Feature Selection with Annealing for Computer Vision and Big Data Learning(1310.2880)

March 17, 2016 cs.CV, math.ST, stat.TH, cs.LG, stat.ML
Many computer vision and medical imaging problems are faced with learning from large-scale datasets, with millions of observations and features. In this paper we propose a novel efficient learning scheme that tightens a sparsity constraint by gradually removing variables based on a criterion and a schedule. The attractive fact that the problem size keeps dropping throughout the iterations makes it particularly suitable for big data learning. Our approach applies generically to the optimization of any differentiable loss function, and finds applications in regression, classification and ranking. The resultant algorithms build variable screening into estimation and are extremely simple to implement. We provide theoretical guarantees of convergence and selection consistency. In addition, one dimensional piecewise linear response functions are used to account for nonlinearity and a second order prior is imposed on these functions to avoid overfitting. Experiments on real and synthetic data show that the proposed method compares very well with other state of the art methods in regression, classification and ranking while being computationally very efficient and scalable.
• ### Sparse Generalized Principal Component Analysis for Large-scale Applications beyond Gaussianity(1512.03883)

Jan. 28, 2016 stat.CO, stat.ML
Principal Component Analysis (PCA) is a dimension reduction technique. It produces inconsistent estimators when the dimensionality is moderate to high, which is often the problem in modern large-scale applications where algorithm scalability and model interpretability are difficult to achieve, not to mention the prevalence of missing values. While existing sparse PCA methods alleviate inconsistency, they are constrained to the Gaussian assumption of classical PCA and fail to address algorithm scalability issues. We generalize sparse PCA to the broad exponential family distributions under high-dimensional setup, with built-in treatment for missing values. Meanwhile we propose a family of iterative sparse generalized PCA (SG-PCA) algorithms such that despite the non-convexity and non-smoothness of the optimization task, the loss function decreases in every iteration. In terms of ease and intuitive parameter tuning, our sparsity-inducing regularization is far superior to the popular Lasso. Furthermore, to promote overall scalability, accelerated gradient is integrated for fast convergence, while a progressive screening technique gradually squeezes out nuisance dimensions of a large-scale problem for feasible optimization. High-dimensional simulation and real data experiments demonstrate the efficiency and efficacy of SG-PCA.
• ### Robust Orthogonal Complement Principal Component Analysis(1410.1173)

Jan. 28, 2016 stat.CO, stat.ME
Recently, the robustification of principal component analysis has attracted lots of attention from statisticians, engineers and computer scientists. In this work we study the type of outliers that are not necessarily apparent in the original observation space but can seriously affect the principal subspace estimation. Based on a mathematical formulation of such transformed outliers, a novel robust orthogonal complement principal component analysis (ROC-PCA) is proposed. The framework combines the popular sparsity-enforcing and low rank regularization techniques to deal with row-wise outliers as well as element-wise outliers. A non-asymptotic oracle inequality guarantees the accuracy and high breakdown performance of ROC-PCA in finite samples. To tackle the computational challenges, an efficient algorithm is developed on the basis of Stiefel manifold optimization and iterative thresholding. Furthermore, a batch variant is proposed to significantly reduce the cost in ultra high dimensions. The paper also points out a pitfall of a common practice of SVD reduction in robust PCA. Experiments show the effectiveness and efficiency of ROC-PCA in both synthetic and real data.
• ### Joint Association Graph Screening and Decomposition for Large-scale Linear Dynamical Systems(1411.4598)

Nov. 17, 2014 stat.CO, stat.ML
This paper studies large-scale dynamical networks where the current state of the system is a linear transformation of the previous state, contaminated by a multivariate Gaussian noise. Examples include stock markets, human brains and gene regulatory networks. We introduce a transition matrix to describe the evolution, which can be translated to a directed Granger transition graph, and use the concentration matrix of the Gaussian noise to capture the second-order relations between nodes, which can be translated to an undirected conditional dependence graph. We propose regularizing the two graphs jointly in topology identification and dynamics estimation. Based on the notion of joint association graph (JAG), we develop a joint graphical screening and estimation (JGSE) framework for efficient network learning in big data. In particular, our method can pre-determine and remove unnecessary edges based on the joint graphical structure, referred to as JAG screening, and can decompose a large network into smaller subnetworks in a robust manner, referred to as JAG decomposition. JAG screening and decomposition can reduce the problem size and search space for fine estimation at a later stage. Experiments on both synthetic data and real-world applications show the effectiveness of the proposed framework in large-scale network topology identification and dynamics estimation.
• ### Learning Topology and Dynamics of Large Recurrent Neural Networks(1410.1174)

Oct. 5, 2014 stat.CO, stat.ML
Large-scale recurrent networks have drawn increasing attention recently because of their capabilities in modeling a large variety of real-world phenomena and physical mechanisms. This paper studies how to identify all authentic connections and estimate system parameters of a recurrent network, given a sequence of node observations. This task becomes extremely challenging in modern network applications, because the available observations are usually very noisy and limited, and the associated dynamical system is strongly nonlinear. By formulating the problem as multivariate sparse sigmoidal regression, we develop simple-to-implement network learning algorithms, with rigorous convergence guarantee in theory, for a variety of sparsity-promoting penalty forms. A quantile variant of progressive recurrent network screening is proposed for efficient computation and allows for direct cardinality control of network topology in estimation. Moreover, we investigate recurrent network stability conditions in Lyapunov's sense, and integrate such stability constraints into sparse network learning. Experiments show excellent performance of the proposed algorithms in network topology identification and forecasting.
• ### The Group Square-Root Lasso: Theoretical Properties and Fast Algorithms(1302.0261)

July 31, 2013 stat.CO, math.ST, stat.TH
We introduce and study the Group Square-Root Lasso (GSRL) method for estimation in high dimensional sparse regression models with group structure. The new estimator minimizes the square root of the residual sum of squares plus a penalty term proportional to the sum of the Euclidean norms of groups of the regression parameter vector. The net advantage of the method over the existing Group Lasso (GL)-type procedures consists in the form of the proportionality factor used in the penalty term, which for GSRL is independent of the variance of the error terms. This is of crucial importance in models with more parameters than the sample size, when estimating the variance of the noise becomes as difficult as the original problem. We show that the GSRL estimator adapts to the unknown sparsity of the regression vector, and has the same optimal estimation and prediction accuracy as the GL estimators, under the same minimal conditions on the model. This extends the results recently established for the Square-Root Lasso, for sparse regression without group structure. Moreover, as a new type of result for Square-Root Lasso methods, with or without groups, we study correct pattern recovery, and show that it can be achieved under conditions similar to those needed by the Lasso or Group-Lasso-type methods, but with a simplified tuning strategy. We implement our method via a new algorithm, with proved convergence properties, which, unlike existing methods, scales well with the dimension of the problem. Our simulation studies support strongly our theoretical findings.
• ### Group Iterative Spectrum Thresholding for Super-Resolution Sparse Spectral Selection(1207.6684)

July 29, 2013 stat.ML
Recently, sparsity-based algorithms are proposed for super-resolution spectrum estimation. However, to achieve adequately high resolution in real-world signal analysis, the dictionary atoms have to be close to each other in frequency, thereby resulting in a coherent design. The popular convex compressed sensing methods break down in presence of high coherence and large noise. We propose a new regularization approach to handle model collinearity and obtain parsimonious frequency selection simultaneously. It takes advantage of the pairing structure of sine and cosine atoms in the frequency dictionary. A probabilistic spectrum screening is also developed for fast computation in high dimensions. A data-resampling version of high-dimensional Bayesian Information Criterion is used to determine the regularization parameters. Experiments show the efficacy and efficiency of the proposed algorithms in challenging situations with small sample size, high frequency resolution, and low signal-to-noise ratio.
• ### Joint variable and rank selection for parsimonious estimation of high-dimensional matrices(1110.3556)

Feb. 13, 2013 math.ST, stat.TH, stat.ME, stat.ML
We propose dimension reduction methods for sparse, high-dimensional multivariate response regression models. Both the number of responses and that of the predictors may exceed the sample size. Sometimes viewed as complementary, predictor selection and rank reduction are the most popular strategies for obtaining lower-dimensional approximations of the parameter matrix in such models. We show in this article that important gains in prediction accuracy can be obtained by considering them jointly. We motivate a new class of sparse multivariate regression models, in which the coefficient matrix has low rank and zero rows or can be well approximated by such a matrix. Next, we introduce estimators that are based on penalized least squares, with novel penalties that impose simultaneous row and rank restrictions on the coefficient matrix. We prove that these estimators indeed adapt to the unknown matrix sparsity and have fast rates of convergence. We support our theoretical results with an extensive simulation study and two data analyses.
• ### Reduced Rank Vector Generalized Linear Models for Feature Extraction(1007.3098)

May 9, 2012 stat.ML
Supervised linear feature extraction can be achieved by fitting a reduced rank multivariate model. This paper studies rank penalized and rank constrained vector generalized linear models. From the perspective of thresholding rules, we build a framework for fitting singular value penalized models and use it for feature extraction. Through solving the rank constraint form of the problem, we propose progressive feature space reduction for fast computation in high dimensions with little performance loss. A novel projective cross-validation is proposed for parameter tuning in such nonconvex setups. Real data applications are given to show the power of the methodology in supervised dimension reduction and feature extraction.
• ### Approximating Higher-Order Distances Using Random Projections(1203.3492)

March 15, 2012 cs.LG, stat.ML
We provide a simple method and relevant theoretical analysis for efficiently estimating higher-order lp distances. While the analysis mainly focuses on l4, our methodology extends naturally to p = 6,8,10..., (i.e., when p is even). Distance-based methods are popular in machine learning. In large-scale applications, storing, computing, and retrieving the distances can be both space and time prohibitive. Efficient algorithms exist for estimating lp distances if 0 < p <= 2. The task for p > 2 is known to be difficult. Our work partially fills this gap.
• ### An Iterative Algorithm for Fitting Nonconvex Penalized Generalized Linear Models with Grouped Predictors(0911.5460)

Nov. 10, 2011 stat.ME, stat.ML
High-dimensional data pose challenges in statistical learning and modeling. Sometimes the predictors can be naturally grouped where pursuing the between-group sparsity is desired. Collinearity may occur in real-world high-dimensional applications where the popular $l_1$ technique suffers from both selection inconsistency and prediction inaccuracy. Moreover, the problems of interest often go beyond Gaussian models. To meet these challenges, nonconvex penalized generalized linear models with grouped predictors are investigated and a simple-to-implement algorithm is proposed for computation. A rigorous theoretical result guarantees its convergence and provides tight preliminary scaling. This framework allows for grouped predictors and nonconvex penalties, including the discrete $l_0$ and the `$l_0+l_2$' type penalties. Penalty design and parameter tuning for nonconvex penalties are examined. Applications of super-resolution spectrum estimation in signal processing and cancer classification with joint gene selection in bioinformatics show the performance improvement by nonconvex penalized estimation.
• ### Outlier Detection Using Nonconvex Penalized Regression(1006.2592)

Oct. 17, 2011 stat.CO, stat.ME, cs.LG
This paper studies the outlier detection problem from the point of view of penalized regressions. Our regression model adds one mean shift parameter for each of the $n$ data points. We then apply a regularization favoring a sparse vector of mean shift parameters. The usual $L_1$ penalty yields a convex criterion, but we find that it fails to deliver a robust estimator. The $L_1$ penalty corresponds to soft thresholding. We introduce a thresholding (denoted by $\Theta$) based iterative procedure for outlier detection ($\Theta$-IPOD). A version based on hard thresholding correctly identifies outliers on some hard test problems. We find that $\Theta$-IPOD is much faster than iteratively reweighted least squares for large data because each iteration costs at most $O(np)$ (and sometimes much less) avoiding an $O(np^2)$ least squares estimate. We describe the connection between $\Theta$-IPOD and $M$-estimators. Our proposed method has one tuning parameter with which to both identify outliers and estimate regression coefficients. A data-dependent choice can be made based on BIC. The tuned $\Theta$-IPOD shows outstanding performance in identifying outliers in various situations in comparison to other existing approaches. This methodology extends to high-dimensional modeling with $p\gg n$, if both the coefficient vector and the outlier pattern are sparse.
• ### Optimal selection of reduced rank estimators of high-dimensional matrices(1004.2995)

Oct. 17, 2011 math.ST, stat.TH, stat.ME
We introduce a new criterion, the Rank Selection Criterion (RSC), for selecting the optimal reduced rank estimator of the coefficient matrix in multivariate response regression models. The corresponding RSC estimator minimizes the Frobenius norm of the fit plus a regularization term proportional to the number of parameters in the reduced rank model. The rank of the RSC estimator provides a consistent estimator of the rank of the coefficient matrix; in general, the rank of our estimator is a consistent estimate of the effective rank, which we define to be the number of singular values of the target matrix that are appropriately large. The consistency results are valid not only in the classic asymptotic regime, when $n$, the number of responses, and $p$, the number of predictors, stay bounded, and $m$, the number of observations, grows, but also when either, or both, $n$ and $p$ grow, possibly much faster than $m$. We establish minimax optimal bounds on the mean squared errors of our estimators. Our finite sample performance bounds for the RSC estimator show that it achieves the optimal balance between the approximation error and the penalty term. Furthermore, our procedure has very low computational complexity, linear in the number of candidate models, making it particularly appealing for large scale problems. We contrast our estimator with the nuclear norm penalized least squares (NNP) estimator, which has an inherently higher computational complexity than RSC, for multivariate regression models. We show that NNP has estimation properties similar to those of RSC, albeit under stronger conditions. However, it is not as parsimonious as RSC. We offer a simple correction of the NNP estimator which leads to consistent rank estimation.
• ### Thresholding-based Iterative Selection Procedures for Model Selection and Shrinkage(0812.5061)

Nov. 29, 2009 math.ST, stat.TH
This paper discusses a class of thresholding-based iterative selection procedures (TISP) for model selection and shrinkage. People have long before noticed the weakness of the convex $l_1$-constraint (or the soft-thresholding) in wavelets and have designed many different forms of nonconvex penalties to increase model sparsity and accuracy. But for a nonorthogonal regression matrix, there is great difficulty in both investigating the performance in theory and solving the problem in computation. TISP provides a simple and efficient way to tackle this so that we successfully borrow the rich results in the orthogonal design to solve the nonconvex penalized regression for a general design matrix. Our starting point is, however, thresholding rules rather than penalty functions. Indeed, there is a universal connection between them. But a drawback of the latter is its non-unique form, and our approach greatly facilitates the computation and the analysis. In fact, we are able to build the convergence theorem and explore theoretical properties of the selection and estimation via TISP nonasymptotically. More importantly, a novel Hybrid-TISP is proposed based on hard-thresholding and ridge-thresholding. It provides a fusion between the $l_0$-penalty and the $l_2$-penalty, and adaptively achieves the right balance between shrinkage and selection in statistical modeling. In practice, Hybrid-TISP shows superior performance in test-error and is parsimonious.