• ### On Lasso refitting strategies(1707.05232)

Nov. 12, 2018 math.ST, stat.TH, stat.AP
A well-know drawback of l_1-penalized estimators is the systematic shrinkage of the large coefficients towards zero. A simple remedy is to treat Lasso as a model-selection procedure and to perform a second refitting step on the selected support. In this work we formalize the notion of refitting and provide oracle bounds for arbitrary refitting procedures of the Lasso solution. One of the most widely used refitting techniques which is based on Least-Squares may bring a problem of interpretability, since the signs of the refitted estimator might be flipped with respect to the original estimator. This problem arises from the fact that the Least-Squares refitting considers only the support of the Lasso solution, avoiding any information about signs or amplitudes. To this end we define a sign consistent refitting as an arbitrary refitting procedure, preserving the signs of the first step Lasso solution and provide Oracle inequalities for such estimators. Finally, we consider special refitting strategies: Bregman Lasso and Boosted Lasso. Bregman Lasso has a fruitful property to converge to the Sign-Least-Squares refitting (Least-Squares with sign constraints), which provides with greater interpretability. We additionally study the Bregman Lasso refitting in the case of orthogonal design, providing with simple intuition behind the proposed method. Boosted Lasso, in contrast, considers information about magnitudes of the first Lasso step and allows to develop better oracle rates for prediction. Finally, we conduct an extensive numerical study to show advantages of one approach over others in different synthetic and semi-real scenarios.
• ### On the benefits of output sparsity for multi-label classification(1703.04697)

March 14, 2017 math.ST, stat.TH, cs.LG, stat.ML
The multi-label classification framework, where each observation can be associated with a set of labels, has generated a tremendous amount of attention over recent years. The modern multi-label problems are typically large-scale in terms of number of observations, features and labels, and the amount of labels can even be comparable with the amount of observations. In this context, different remedies have been proposed to overcome the curse of dimensionality. In this work, we aim at exploiting the output sparsity by introducing a new loss, called the sparse weighted Hamming loss. This proposed loss can be seen as a weighted version of classical ones, where active and inactive labels are weighted separately. Leveraging the influence of sparsity in the loss function, we provide improved generalization bounds for the empirical risk minimizer, a suitable property for large-scale problems. For this new loss, we derive rates of convergence linear in the underlying output-sparsity rather than linear in the number of labels. In practice, minimizing the associated risk can be performed efficiently by using convex surrogates and modern convex optimization algorithms. We provide experiments on various real-world datasets demonstrating the pertinence of our approach when compared to non-weighted techniques.
• ### On the Prediction Performance of the Lasso(1402.1700)

Nov. 8, 2016 math.ST, stat.TH, stat.ML
Although the Lasso has been extensively studied, the relationship between its prediction performance and the correlations of the covariates is not fully understood. In this paper, we give new insights into this relationship in the context of multiple linear regression. We show, in particular, that the incorporation of a simple correlation measure into the tuning parameter can lead to a nearly optimal prediction performance of the Lasso even for highly correlated covariates. However, we also reveal that for moderately correlated covariates, the prediction performance of the Lasso can be mediocre irrespective of the choice of the tuning parameter. We finally show that our results also lead to near-optimal rates for the least-squares estimator with total variation penalty.
• ### How Correlations Influence Lasso Prediction(1204.1605)

July 9, 2012 math.ST, stat.TH
We study how correlations in the design matrix influence Lasso prediction. First, we argue that the higher the correlations are, the smaller the optimal tuning parameter is. This implies in particular that the standard tuning parameters, that do not depend on the design matrix, are not favorable. Furthermore, we argue that Lasso prediction works well for any degree of correlations if suitable tuning parameters are chosen. We study these two subjects theoretically as well as with simulations.
• ### The Smooth-Lasso and other $\ell_1+\ell_2$-penalized methods(1003.4885)

Oct. 7, 2011 math.ST, stat.TH
We consider a linear regression problem in a high dimensional setting where the number of covariates $p$ can be much larger than the sample size $n$. In such a situation, one often assumes sparsity of the regression vector, \textit i.e., the regression vector contains many zero components. We propose a Lasso-type estimator $\hat{\beta}^{Quad}$ (where '$Quad$' stands for quadratic) which is based on two penalty terms. The first one is the $\ell_1$ norm of the regression coefficients used to exploit the sparsity of the regression as done by the Lasso estimator, whereas the second is a quadratic penalty term introduced to capture some additional information on the setting of the problem. We detail two special cases: the Elastic-Net $\hat{\beta}^{EN}$, which deals with sparse problems where correlations between variables may exist; and the Smooth-Lasso $\hat{\beta}^{SL}$, which responds to sparse problems where successive regression coefficients are known to vary slowly (in some situations, this can also be interpreted in terms of correlations between successive variables). From a theoretical point of view, we establish variable selection consistency results and show that $\hat{\beta}^{Quad}$ achieves a Sparsity Inequality, \textit i.e., a bound in terms of the number of non-zero components of the 'true' regression vector. These results are provided under a weaker assumption on the Gram matrix than the one used by the Lasso. In some situations this guarantees a significant improvement over the Lasso. Furthermore, a simulation study is conducted and shows that the S-Lasso $\hat{\beta}^{SL}$ performs better than known methods as the Lasso, the Elastic-Net $\hat{\beta}^{EN}$, and the Fused-Lasso with respect to the estimation accuracy. This is especially the case when the regression vector is 'smooth', \textit i.e., when the variations between successive coefficients of the unknown parameter of the regression are small. The study also reveals that the theoretical calibration of the tuning parameters and the one based on 10 fold cross validation imply two S-Lasso solutions with close performance.
• ### Sparse Conformal Predictors(0902.1970)

Feb. 11, 2009 math.ST, stat.TH, stat.ML
Conformal predictors, introduced by Vovk et al. (2005), serve to build prediction intervals by exploiting a notion of conformity of the new data point with previously observed data. In the present paper, we propose a novel method for constructing prediction intervals for the response variable in multivariate linear models. The main emphasis is on sparse linear models, where only few of the covariates have significant influence on the response variable even if their number is very large. Our approach is based on combining the principle of conformal prediction with the $\ell_1$ penalized least squares estimator (LASSO). The resulting confidence set depends on a parameter $\epsilon>0$ and has a coverage probability larger than or equal to $1-\epsilon$. The numerical experiments reported in the paper show that the length of the confidence set is small. Furthermore, as a by-product of the proposed approach, we provide a data-driven procedure for choosing the LASSO penalty. The selection power of the method is illustrated on simulated data.
• ### Regularization with the Smooth-Lasso procedure(0803.0668)

Oct. 15, 2008 math.ST, stat.TH
We consider the linear regression problem. We propose the S-Lasso procedure to estimate the unknown regression parameters. This estimator enjoys sparsity of the representation while taking into account correlation between successive covariates (or predictors). The study covers the case when $p\gg n$, i.e. the number of covariates is much larger than the number of observations. In the theoretical point of view, for fixed $p$, we establish asymptotic normality and consistency in variable selection results for our procedure. When $p\geq n$, we provide variable selection consistency results and show that the S-Lasso achieved a Sparsity Inequality, i.e., a bound in term of the number of non-zero components of the oracle vector. It appears that the S-Lasso has nice variable selection properties compared to its challengers. Furthermore, we provide an estimator of the effective degree of freedom of the S-Lasso estimator. A simulation study shows that the S-Lasso performs better than the Lasso as far as variable selection is concerned especially when high correlations between successive covariates exist. This procedure also appears to be a good challenger to the Elastic-Net (Zou and Hastie, 2005).