• ### On Lasso refitting strategies(1707.05232)

Nov. 12, 2018 math.ST, stat.TH, stat.AP
A well-know drawback of l_1-penalized estimators is the systematic shrinkage of the large coefficients towards zero. A simple remedy is to treat Lasso as a model-selection procedure and to perform a second refitting step on the selected support. In this work we formalize the notion of refitting and provide oracle bounds for arbitrary refitting procedures of the Lasso solution. One of the most widely used refitting techniques which is based on Least-Squares may bring a problem of interpretability, since the signs of the refitted estimator might be flipped with respect to the original estimator. This problem arises from the fact that the Least-Squares refitting considers only the support of the Lasso solution, avoiding any information about signs or amplitudes. To this end we define a sign consistent refitting as an arbitrary refitting procedure, preserving the signs of the first step Lasso solution and provide Oracle inequalities for such estimators. Finally, we consider special refitting strategies: Bregman Lasso and Boosted Lasso. Bregman Lasso has a fruitful property to converge to the Sign-Least-Squares refitting (Least-Squares with sign constraints), which provides with greater interpretability. We additionally study the Bregman Lasso refitting in the case of orthogonal design, providing with simple intuition behind the proposed method. Boosted Lasso, in contrast, considers information about magnitudes of the first Lasso step and allows to develop better oracle rates for prediction. Finally, we conduct an extensive numerical study to show advantages of one approach over others in different synthetic and semi-real scenarios.
• ### Dual Extrapolation for Faster Lasso Solvers(1802.07481)

Feb. 22, 2018 stat.ML
Convex sparsity-inducing regularizations are ubiquitous in high-dimension machine learning, but their non-differentiability requires the use of iterative solvers. To accelerate such solvers, state-of-the-art approaches consist in reducing the size of the optimization problem at hand. In the context of regression, this can be achieved either by discarding irrelevant features (screening techniques) or by prioritizing features likely to be included in the support of the solution (working set techniques). Duality comes into play at several steps in these techniques. Here, we propose an extrapolation technique starting from a sequence of iterates in the dual that leads to the construction of an improved dual point. This enables a tighter control of optimality as used in stopping criterion, as well as better screening performance of Gap Safe rules. Finally, we propose a working set strategy based on an aggressive use of Gap Safe rules and our new dual point construction, which improves state-of-the-art time performance on Lasso problems.
• ### Gap Safe screening rules for sparsity enforcing penalties(1611.05780)

Dec. 27, 2017 stat.CO, math.OC, cs.LG, stat.ML
In high dimensional regression settings, sparsity enforcing penalties have proved useful to regularize the data-fitting term. A recently introduced technique called screening rules propose to ignore some variables in the optimization leveraging the expected sparsity of the solutions and consequently leading to faster solvers. When the procedure is guaranteed not to discard variables wrongly the rules are said to be safe. In this work, we propose a unifying framework for generalized linear models regularized with standard sparsity enforcing penalties such as $\ell_1$ or $\ell_1/\ell_2$ norms. Our technique allows to discard safely more variables than previously considered safe rules, particularly for low regularization parameters. Our proposed Gap Safe rules (so called because they rely on duality gap computation) can cope with any iterative solver but are particularly well suited to (block) coordinate descent methods. Applied to many standard learning tasks, Lasso, Sparse-Group Lasso, multi-task Lasso, binary and multinomial logistic regression, etc., we report significant speed-ups compared to previously proposed safe rules on all tested data sets.
• ### Generalized Concomitant Multi-Task Lasso for sparse multimodal regression(1705.09778)

Oct. 18, 2017 stat.AP, math.OC, stat.ML
In high dimension, it is customary to consider Lasso-type estimators to enforce sparsity. For standard Lasso theory to hold, the regularization parameter should be proportional to the noise level, yet the latter is generally unknown in practice. A possible remedy is to consider estimators, such as the Concomitant/Scaled Lasso, which jointly optimize over the regression coefficients as well as over the noise level, making the choice of the regularization independent of the noise level. However, when data from different sources are pooled to increase sample size, or when dealing with multimodal datasets, noise levels typically differ and new dedicated estimators are needed. In this work we provide new statistical and computational solutions to deal with such heteroscedastic regression models, with an emphasis on functional brain imaging with combined magneto- and electroencephalographic (M/EEG) signals. Adopting the formulation of Concomitant Lasso-type estimators, we propose a jointly convex formulation to estimate both the regression coefficients and the (square root of the) noise covariance. When our framework is instantiated to de-correlated noise, it leads to an efficient algorithm whose computational cost is not higher than for the Lasso and Concomitant Lasso, while addressing more complex noise structures. Numerical experiments demonstrate that our estimator yields improved prediction and support identification while correctly estimating the noise (square root) covariance. Results on multimodal neuroimaging problems with M/EEG data are also reported.
• ### A sharp oracle inequality for Graph-Slope(1706.06977)

June 21, 2017 math.ST, stat.TH
Following recent success on the analysis of the Slope estimator, we provide a sharp oracle inequality in term of prediction error for Graph-Slope, a generalization of Slope to signals observed over a graph. In addition to improving upon best results obtained so far for the Total Variation denoiser (also referred to as Graph-Lasso or Generalized Lasso), we propose an efficient algorithm to compute Graph-Slope. The proposed algorithm is obtained by applying the forward-backward method to the dual formulation of the Graph-Slope optimization problem. We also provide experiments showing the interest of the method.
• ### Optimal Two-Step Prediction in Regression(1410.5014)

June 6, 2017 math.ST, stat.TH, stat.ME
High-dimensional prediction typically comprises two steps: variable selection and subsequent least-squares refitting on the selected variables. However, the standard variable selection procedures, such as the lasso, hinge on tuning parameters that need to be calibrated. Cross-validation, the most popular calibration scheme, is computationally costly and lacks finite sample guarantees. In this paper, we introduce an alternative scheme, easy to implement and both computationally and theoretically efficient.
• ### From safe screening rules to working sets for faster Lasso-type solvers(1703.07285)

May 1, 2017 stat.CO, math.OC, cs.LG, stat.ML
Convex sparsity-promoting regularizations are ubiquitous in modern statistical learning. By construction, they yield solutions with few non-zero coefficients, which correspond to saturated constraints in the dual optimization formulation. Working set (WS) strategies are generic optimization techniques that consist in solving simpler problems that only consider a subset of constraints, whose indices form the WS. Working set methods therefore involve two nested iterations: the outer loop corresponds to the definition of the WS and the inner loop calls a solver for the subproblems. For the Lasso estimator a WS is a set of features, while for a Group Lasso it refers to a set of groups. In practice, WS are generally small in this context so the associated feature Gram matrix can fit in memory. Here we show that the Gauss-Southwell rule (a greedy strategy for block coordinate descent techniques) leads to fast solvers in this case. Combined with a working set strategy based on an aggressive use of so-called Gap Safe screening rules, we propose a solver achieving state-of-the-art performance on sparse learning problems. Results are presented on Lasso and multi-task Lasso estimators.
• ### On the benefits of output sparsity for multi-label classification(1703.04697)

March 14, 2017 math.ST, stat.TH, cs.LG, stat.ML
The multi-label classification framework, where each observation can be associated with a set of labels, has generated a tremendous amount of attention over recent years. The modern multi-label problems are typically large-scale in terms of number of observations, features and labels, and the amount of labels can even be comparable with the amount of observations. In this context, different remedies have been proposed to overcome the curse of dimensionality. In this work, we aim at exploiting the output sparsity by introducing a new loss, called the sparse weighted Hamming loss. This proposed loss can be seen as a weighted version of classical ones, where active and inactive labels are weighted separately. Leveraging the influence of sparsity in the loss function, we provide improved generalization bounds for the empirical risk minimizer, a suitable property for large-scale problems. For this new loss, we derive rates of convergence linear in the underlying output-sparsity rather than linear in the number of labels. In practice, minimizing the associated risk can be performed efficiently by using convex surrogates and modern convex optimization algorithms. We provide experiments on various real-world datasets demonstrating the pertinence of our approach when compared to non-weighted techniques.
• ### Characterizing the maximum parameter of the total-variation denoising through the pseudo-inverse of the divergence(1612.03080)

Dec. 8, 2016 stat.ML
We focus on the maximum regularization parameter for anisotropic total-variation denoising. It corresponds to the minimum value of the regularization parameter above which the solution remains constant. While this value is well know for the Lasso, such a critical value has not been investigated in details for the total-variation. Though, it is of importance when tuning the regularization parameter as it allows fixing an upper-bound on the grid for which the optimal parameter is sought. We establish a closed form expression for the one-dimensional case, as well as an upper-bound for the two-dimensional case, that appears reasonably tight in practice. This problem is directly linked to the computation of the pseudo-inverse of the divergence, which can be quickly obtained by performing convolutions in the Fourier domain.
• ### Adapting to unknown noise level in sparse deconvolution(1606.04760)

Oct. 19, 2016 cs.IT, math.IT, math.ST, stat.TH, math.OC
In this paper, we study sparse spike deconvolution over the space of complex-valued measures when the input measure is a finite sum of Dirac masses. We introduce a modified version of the Beurling Lasso (BLasso), a semi-definite program that we refer to as the Concomitant Beurling Lasso (CBLasso). This new procedure estimates the target measure and the unknown noise level simultaneously. Contrary to previous estimators in the literature, theory holds for a tuning parameter that depends only on the sample size, so that it can be used for unknown noise level problems. Consistent noise level estimation is standardly proved. As for Radon measure estimation, theoretical guarantees match the previous state-of-the-art results in Super-Resolution regarding minimax prediction and localization. The proofs are based on a bound on the noise level given by a new tail estimate of the supremum of a stationary non-Gaussian process through the Rice method.
• ### Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression(1606.02702)

June 8, 2016 math.OC, cs.LG, stat.ML
In high dimensional settings, sparse structures are crucial for efficiency, both in term of memory, computation and performance. It is customary to consider $\ell_1$ penalty to enforce sparsity in such scenarios. Sparsity enforcing methods, the Lasso being a canonical example, are popular candidates to address high dimension. For efficiency, they rely on tuning a parameter trading data fitting versus sparsity. For the Lasso theory to hold this tuning parameter should be proportional to the noise level, yet the latter is often unknown in practice. A possible remedy is to jointly optimize over the regression parameter as well as over the noise level. This has been considered under several names in the literature: Scaled-Lasso, Square-root Lasso, Concomitant Lasso estimation for instance, and could be of interest for confidence sets or uncertainty quantification. In this work, after illustrating numerical difficulties for the Smoothed Concomitant Lasso formulation, we propose a modification we coined Smoothed Concomitant Lasso, aimed at increasing numerical stability. We propose an efficient and accurate solver leading to a computational cost no more expansive than the one for the Lasso. We leverage on standard ingredients behind the success of fast Lasso solvers: a coordinate descent algorithm, combined with safe screening rules to achieve speed efficiency, by eliminating early irrelevant features.
• ### Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions(1606.02421)

June 8, 2016 cs.AI, cs.DC, cs.LG, stat.ML, cs.SY
In decentralized networks (of sensors, connected objects, etc.), there is an important need for efficient algorithms to optimize a global cost function, for instance to learn a global model from the local data collected by each computing unit. In this paper, we address the problem of decentralized minimization of pairwise functions of the data points, where these points are distributed over the nodes of a graph defining the communication topology of the network. This general problem finds applications in ranking, distance metric learning and graph inference, among others. We propose new gossip algorithms based on dual averaging which aims at solving such problems both in synchronous and asynchronous settings. The proposed framework is flexible enough to deal with constrained and regularized variants of the optimization problem. Our theoretical analysis reveals that the proposed algorithms preserve the convergence rate of centralized dual averaging up to an additive bias term. We present numerical simulations on Area Under the ROC Curve (AUC) maximization and metric learning problems which illustrate the practical interest of our approach.
• ### GAP Safe Screening Rules for Sparse-Group-Lasso(1602.06225)

Feb. 19, 2016 stat.CO, math.OC, cs.LG, stat.ML
In high dimensional settings, sparse structures are crucial for efficiency, either in term of memory, computation or performance. In some contexts, it is natural to handle more refined structures than pure sparsity, such as for instance group sparsity. Sparse-Group Lasso has recently been introduced in the context of linear regression to enforce sparsity both at the feature level and at the group level. We adapt to the case of Sparse-Group Lasso recent safe screening rules that discard early in the solver irrelevant features/groups. Such rules have led to important speed-ups for a wide range of iterative methods. Thanks to dual gap computations, we provide new safe screening rules for Sparse-Group Lasso and show significant gains in term of computing time for a coordinate descent implementation.
• ### Mind the duality gap: safer rules for the Lasso(1505.03410)

Dec. 3, 2015 stat.CO, math.OC, cs.LG, stat.ML
Screening rules allow to early discard irrelevant variables from the optimization in Lasso problems, or its derivatives, making solvers faster. In this paper, we propose new versions of the so-called $\textit{safe rules}$ for the Lasso. Based on duality gap considerations, our new rules create safe test regions whose diameters converge to zero, provided that one relies on a converging solver. This property helps screening out more variables, for a wider range of regularization parameter values. In addition to faster convergence, we prove that we correctly identify the active sets (supports) of the solutions in finite time. While our proposed strategy can cope with any solver, its performance is demonstrated using a coordinate descent algorithm particularly adapted to machine learning use cases. Significant computing time reductions are obtained with respect to previous safe rules.
• ### GAP Safe screening rules for sparse multi-task and multi-class models(1506.03736)

Nov. 18, 2015 stat.CO, math.OC, cs.LG, stat.ML
High dimensional regression benefits from sparsity promoting regularizations. Screening rules leverage the known sparsity of the solution by ignoring some variables in the optimization, hence speeding up solvers. When the procedure is proven not to discard features wrongly the rules are said to be \emph{safe}. In this paper we derive new safe rules for generalized linear models regularized with $\ell_1$ and $\ell_1/\ell_2$ norms. The rules are based on duality gap computations and spherical safe regions whose diameters converge to zero. This allows to discard safely more variables, in particular for low regularization parameters. The GAP Safe rule can cope with any iterative solver and we illustrate its performance on coordinate descent for multi-task Lasso, binary and multinomial logistic regression, demonstrating significant speed ups on all tested datasets with respect to previous safe rules.
• ### Extending Gossip Algorithms to Distributed Estimation of U-Statistics(1511.05464)

Nov. 17, 2015 cs.DC, stat.CO, cs.LG, stat.ML, cs.SY
Efficient and robust algorithms for decentralized estimation in networks are essential to many distributed systems. Whereas distributed estimation of sample mean statistics has been the subject of a good deal of attention, computation of $U$-statistics, relying on more expensive averaging over pairs of observations, is a less investigated area. Yet, such data functionals are essential to describe global properties of a statistical population, with important examples including Area Under the Curve, empirical variance, Gini mean difference and within-cluster point scatter. This paper proposes new synchronous and asynchronous randomized gossip algorithms which simultaneously propagate data across the network and maintain local estimates of the $U$-statistic of interest. We establish convergence rate bounds of $O(1/t)$ and $O(\log t / t)$ for the synchronous and asynchronous cases respectively, where $t$ is the number of iterations, with explicit data and network dependent terms. Beyond favorable comparisons in terms of rate analysis, numerical experiments provide empirical evidence the proposed algorithms surpasses the previously introduced approach.
• ### On debiasing restoration algorithms: applications to total-variation and nonlocal-means(1503.01587)

March 5, 2015 math.ST, stat.TH
Bias in image restoration algorithms can hamper further analysis, typically when the intensities have a physical meaning of interest, e.g., in medical imaging. We propose to suppress a part of the bias -- the method bias -- while leaving unchanged the other unavoidable part -- the model bias. Our debiasing technique can be used for any locally affine estimator including \^a1 regularization, anisotropic total-variation and some nonlocal filters.
• ### Poisson noise reduction with non-local PCA(1206.0338)

April 28, 2014 cs.CV, stat.CO, cs.LG
Photon-limited imaging arises when the number of photons collected by a sensor array is small relative to the number of detector elements. Photon limitations are an important concern for many applications such as spectral imaging, night vision, nuclear medicine, and astronomy. Typically a Poisson distribution is used to model these observations, and the inherent heteroscedasticity of the data combined with standard noise removal methods yields significant artifacts. This paper introduces a novel denoising algorithm for photon-limited images which combines elements of dictionary learning and sparse patch-based representations of images. The method employs both an adaptation of Principal Component Analysis (PCA) for Poisson noise and recently developed sparsity-regularized convex optimization algorithms for photon-limited images. A comprehensive empirical evaluation of the proposed method helps characterize the performance of this approach relative to other state-of-the-art denoising methods. The results reveal that, despite its conceptual simplicity, Poisson PCA-based denoising appears to be highly competitive in very low light regimes.
• ### A two-stage denoising filter: the preprocessed Yaroslavsky filter(1208.6516)

Aug. 31, 2012 cs.CV, math.ST, stat.TH
This paper describes a simple image noise removal method which combines a preprocessing step with the Yaroslavsky filter for strong numerical, visual, and theoretical performance on a broad class of images. The framework developed is a two-stage approach. In the first stage the image is filtered with a classical denoising method (e.g., wavelet or curvelet thresholding). In the second stage a modification of the Yaroslavsky filter is performed on the original noisy image, where the weights of the filters are governed by pixel similarities in the denoised image from the first stage. Similar prefiltering ideas have proved effective previously in the literature, and this paper provides theoretical guarantees and important insight into why prefiltering can be effective. Empirically, this simple approach achieves very good performance for cartoon images, and can be computed much more quickly than current patch-based denoising algorithms.
• ### Oracle inequalities and minimax rates for non-local means and related adaptive kernel-based methods(1112.4434)

April 26, 2012 cs.IT, math.IT, cs.CV, math.ST, stat.TH
This paper describes a novel theoretical characterization of the performance of non-local means (NLM) for noise removal. NLM has proven effective in a variety of empirical studies, but little is understood fundamentally about how it performs relative to classical methods based on wavelets or how various parameters (e.g., patch size) should be chosen. For cartoon images and images which may contain thin features and regular textures, the error decay rates of NLM are derived and compared with those of linear filtering, oracle estimators, variable-bandwidth kernel methods, Yaroslavsky's filter and wavelet thresholding estimators. The trade-off between global and local search for matching patches is examined, and the bias reduction associated with the local polynomial regression version of NLM is analyzed. The theoretical results are validated via simulations for 2D images corrupted by additive white Gaussian noise.