
A wellknow drawback of l_1penalized estimators is the systematic shrinkage
of the large coefficients towards zero. A simple remedy is to treat Lasso as a
modelselection procedure and to perform a second refitting step on the
selected support. In this work we formalize the notion of refitting and provide
oracle bounds for arbitrary refitting procedures of the Lasso solution. One of
the most widely used refitting techniques which is based on LeastSquares may
bring a problem of interpretability, since the signs of the refitted estimator
might be flipped with respect to the original estimator. This problem arises
from the fact that the LeastSquares refitting considers only the support of
the Lasso solution, avoiding any information about signs or amplitudes. To this
end we define a sign consistent refitting as an arbitrary refitting procedure,
preserving the signs of the first step Lasso solution and provide Oracle
inequalities for such estimators. Finally, we consider special refitting
strategies: Bregman Lasso and Boosted Lasso. Bregman Lasso has a fruitful
property to converge to the SignLeastSquares refitting (LeastSquares with
sign constraints), which provides with greater interpretability. We
additionally study the Bregman Lasso refitting in the case of orthogonal
design, providing with simple intuition behind the proposed method. Boosted
Lasso, in contrast, considers information about magnitudes of the first Lasso
step and allows to develop better oracle rates for prediction. Finally, we
conduct an extensive numerical study to show advantages of one approach over
others in different synthetic and semireal scenarios.

Convex sparsityinducing regularizations are ubiquitous in highdimension
machine learning, but their nondifferentiability requires the use of iterative
solvers. To accelerate such solvers, stateoftheart approaches consist in
reducing the size of the optimization problem at hand. In the context of
regression, this can be achieved either by discarding irrelevant features
(screening techniques) or by prioritizing features likely to be included in the
support of the solution (working set techniques). Duality comes into play at
several steps in these techniques. Here, we propose an extrapolation technique
starting from a sequence of iterates in the dual that leads to the construction
of an improved dual point. This enables a tighter control of optimality as used
in stopping criterion, as well as better screening performance of Gap Safe
rules. Finally, we propose a working set strategy based on an aggressive use of
Gap Safe rules and our new dual point construction, which improves
stateoftheart time performance on Lasso problems.

In high dimensional regression settings, sparsity enforcing penalties have
proved useful to regularize the datafitting term. A recently introduced
technique called screening rules propose to ignore some variables in the
optimization leveraging the expected sparsity of the solutions and consequently
leading to faster solvers. When the procedure is guaranteed not to discard
variables wrongly the rules are said to be safe. In this work, we propose a
unifying framework for generalized linear models regularized with standard
sparsity enforcing penalties such as $\ell_1$ or $\ell_1/\ell_2$ norms. Our
technique allows to discard safely more variables than previously considered
safe rules, particularly for low regularization parameters. Our proposed Gap
Safe rules (so called because they rely on duality gap computation) can cope
with any iterative solver but are particularly well suited to (block)
coordinate descent methods. Applied to many standard learning tasks, Lasso,
SparseGroup Lasso, multitask Lasso, binary and multinomial logistic
regression, etc., we report significant speedups compared to previously
proposed safe rules on all tested data sets.

In high dimension, it is customary to consider Lassotype estimators to
enforce sparsity. For standard Lasso theory to hold, the regularization
parameter should be proportional to the noise level, yet the latter is
generally unknown in practice. A possible remedy is to consider estimators,
such as the Concomitant/Scaled Lasso, which jointly optimize over the
regression coefficients as well as over the noise level, making the choice of
the regularization independent of the noise level. However, when data from
different sources are pooled to increase sample size, or when dealing with
multimodal datasets, noise levels typically differ and new dedicated estimators
are needed. In this work we provide new statistical and computational solutions
to deal with such heteroscedastic regression models, with an emphasis on
functional brain imaging with combined magneto and electroencephalographic
(M/EEG) signals. Adopting the formulation of Concomitant Lassotype estimators,
we propose a jointly convex formulation to estimate both the regression
coefficients and the (square root of the) noise covariance. When our framework
is instantiated to decorrelated noise, it leads to an efficient algorithm
whose computational cost is not higher than for the Lasso and Concomitant
Lasso, while addressing more complex noise structures. Numerical experiments
demonstrate that our estimator yields improved prediction and support
identification while correctly estimating the noise (square root) covariance.
Results on multimodal neuroimaging problems with M/EEG data are also reported.

Following recent success on the analysis of the Slope estimator, we provide a
sharp oracle inequality in term of prediction error for GraphSlope, a
generalization of Slope to signals observed over a graph. In addition to
improving upon best results obtained so far for the Total Variation denoiser
(also referred to as GraphLasso or Generalized Lasso), we propose an efficient
algorithm to compute GraphSlope. The proposed algorithm is obtained by
applying the forwardbackward method to the dual formulation of the GraphSlope
optimization problem. We also provide experiments showing the interest of the
method.

Highdimensional prediction typically comprises two steps: variable selection
and subsequent leastsquares refitting on the selected variables. However, the
standard variable selection procedures, such as the lasso, hinge on tuning
parameters that need to be calibrated. Crossvalidation, the most popular
calibration scheme, is computationally costly and lacks finite sample
guarantees. In this paper, we introduce an alternative scheme, easy to
implement and both computationally and theoretically efficient.

Convex sparsitypromoting regularizations are ubiquitous in modern
statistical learning. By construction, they yield solutions with few nonzero
coefficients, which correspond to saturated constraints in the dual
optimization formulation. Working set (WS) strategies are generic optimization
techniques that consist in solving simpler problems that only consider a subset
of constraints, whose indices form the WS. Working set methods therefore
involve two nested iterations: the outer loop corresponds to the definition of
the WS and the inner loop calls a solver for the subproblems. For the Lasso
estimator a WS is a set of features, while for a Group Lasso it refers to a set
of groups. In practice, WS are generally small in this context so the
associated feature Gram matrix can fit in memory. Here we show that the
GaussSouthwell rule (a greedy strategy for block coordinate descent
techniques) leads to fast solvers in this case. Combined with a working set
strategy based on an aggressive use of socalled Gap Safe screening rules, we
propose a solver achieving stateoftheart performance on sparse learning
problems. Results are presented on Lasso and multitask Lasso estimators.

The multilabel classification framework, where each observation can be
associated with a set of labels, has generated a tremendous amount of attention
over recent years. The modern multilabel problems are typically largescale in
terms of number of observations, features and labels, and the amount of labels
can even be comparable with the amount of observations. In this context,
different remedies have been proposed to overcome the curse of dimensionality.
In this work, we aim at exploiting the output sparsity by introducing a new
loss, called the sparse weighted Hamming loss. This proposed loss can be seen
as a weighted version of classical ones, where active and inactive labels are
weighted separately. Leveraging the influence of sparsity in the loss function,
we provide improved generalization bounds for the empirical risk minimizer, a
suitable property for largescale problems. For this new loss, we derive rates
of convergence linear in the underlying outputsparsity rather than linear in
the number of labels. In practice, minimizing the associated risk can be
performed efficiently by using convex surrogates and modern convex optimization
algorithms. We provide experiments on various realworld datasets demonstrating
the pertinence of our approach when compared to nonweighted techniques.

We focus on the maximum regularization parameter for anisotropic
totalvariation denoising. It corresponds to the minimum value of the
regularization parameter above which the solution remains constant. While this
value is well know for the Lasso, such a critical value has not been
investigated in details for the totalvariation. Though, it is of importance
when tuning the regularization parameter as it allows fixing an upperbound on
the grid for which the optimal parameter is sought. We establish a closed form
expression for the onedimensional case, as well as an upperbound for the
twodimensional case, that appears reasonably tight in practice. This problem
is directly linked to the computation of the pseudoinverse of the divergence,
which can be quickly obtained by performing convolutions in the Fourier domain.

In this paper, we study sparse spike deconvolution over the space of
complexvalued measures when the input measure is a finite sum of Dirac masses.
We introduce a modified version of the Beurling Lasso (BLasso), a semidefinite
program that we refer to as the Concomitant Beurling Lasso (CBLasso). This new
procedure estimates the target measure and the unknown noise level
simultaneously. Contrary to previous estimators in the literature, theory holds
for a tuning parameter that depends only on the sample size, so that it can be
used for unknown noise level problems. Consistent noise level estimation is
standardly proved. As for Radon measure estimation, theoretical guarantees
match the previous stateoftheart results in SuperResolution regarding
minimax prediction and localization. The proofs are based on a bound on the
noise level given by a new tail estimate of the supremum of a stationary
nonGaussian process through the Rice method.

In high dimensional settings, sparse structures are crucial for efficiency,
both in term of memory, computation and performance. It is customary to
consider $\ell_1$ penalty to enforce sparsity in such scenarios. Sparsity
enforcing methods, the Lasso being a canonical example, are popular candidates
to address high dimension. For efficiency, they rely on tuning a parameter
trading data fitting versus sparsity. For the Lasso theory to hold this tuning
parameter should be proportional to the noise level, yet the latter is often
unknown in practice. A possible remedy is to jointly optimize over the
regression parameter as well as over the noise level. This has been considered
under several names in the literature: ScaledLasso, Squareroot Lasso,
Concomitant Lasso estimation for instance, and could be of interest for
confidence sets or uncertainty quantification. In this work, after illustrating
numerical difficulties for the Smoothed Concomitant Lasso formulation, we
propose a modification we coined Smoothed Concomitant Lasso, aimed at
increasing numerical stability. We propose an efficient and accurate solver
leading to a computational cost no more expansive than the one for the Lasso.
We leverage on standard ingredients behind the success of fast Lasso solvers: a
coordinate descent algorithm, combined with safe screening rules to achieve
speed efficiency, by eliminating early irrelevant features.

In decentralized networks (of sensors, connected objects, etc.), there is an
important need for efficient algorithms to optimize a global cost function, for
instance to learn a global model from the local data collected by each
computing unit. In this paper, we address the problem of decentralized
minimization of pairwise functions of the data points, where these points are
distributed over the nodes of a graph defining the communication topology of
the network. This general problem finds applications in ranking, distance
metric learning and graph inference, among others. We propose new gossip
algorithms based on dual averaging which aims at solving such problems both in
synchronous and asynchronous settings. The proposed framework is flexible
enough to deal with constrained and regularized variants of the optimization
problem. Our theoretical analysis reveals that the proposed algorithms preserve
the convergence rate of centralized dual averaging up to an additive bias term.
We present numerical simulations on Area Under the ROC Curve (AUC) maximization
and metric learning problems which illustrate the practical interest of our
approach.

In high dimensional settings, sparse structures are crucial for efficiency,
either in term of memory, computation or performance. In some contexts, it is
natural to handle more refined structures than pure sparsity, such as for
instance group sparsity. SparseGroup Lasso has recently been introduced in the
context of linear regression to enforce sparsity both at the feature level and
at the group level. We adapt to the case of SparseGroup Lasso recent safe
screening rules that discard early in the solver irrelevant features/groups.
Such rules have led to important speedups for a wide range of iterative
methods. Thanks to dual gap computations, we provide new safe screening rules
for SparseGroup Lasso and show significant gains in term of computing time for
a coordinate descent implementation.

Screening rules allow to early discard irrelevant variables from the
optimization in Lasso problems, or its derivatives, making solvers faster. In
this paper, we propose new versions of the socalled $\textit{safe rules}$ for
the Lasso. Based on duality gap considerations, our new rules create safe test
regions whose diameters converge to zero, provided that one relies on a
converging solver. This property helps screening out more variables, for a
wider range of regularization parameter values. In addition to faster
convergence, we prove that we correctly identify the active sets (supports) of
the solutions in finite time. While our proposed strategy can cope with any
solver, its performance is demonstrated using a coordinate descent algorithm
particularly adapted to machine learning use cases. Significant computing time
reductions are obtained with respect to previous safe rules.

High dimensional regression benefits from sparsity promoting regularizations.
Screening rules leverage the known sparsity of the solution by ignoring some
variables in the optimization, hence speeding up solvers. When the procedure is
proven not to discard features wrongly the rules are said to be \emph{safe}. In
this paper we derive new safe rules for generalized linear models regularized
with $\ell_1$ and $\ell_1/\ell_2$ norms. The rules are based on duality gap
computations and spherical safe regions whose diameters converge to zero. This
allows to discard safely more variables, in particular for low regularization
parameters. The GAP Safe rule can cope with any iterative solver and we
illustrate its performance on coordinate descent for multitask Lasso, binary
and multinomial logistic regression, demonstrating significant speed ups on all
tested datasets with respect to previous safe rules.

Efficient and robust algorithms for decentralized estimation in networks are
essential to many distributed systems. Whereas distributed estimation of sample
mean statistics has been the subject of a good deal of attention, computation
of $U$statistics, relying on more expensive averaging over pairs of
observations, is a less investigated area. Yet, such data functionals are
essential to describe global properties of a statistical population, with
important examples including Area Under the Curve, empirical variance, Gini
mean difference and withincluster point scatter. This paper proposes new
synchronous and asynchronous randomized gossip algorithms which simultaneously
propagate data across the network and maintain local estimates of the
$U$statistic of interest. We establish convergence rate bounds of $O(1/t)$ and
$O(\log t / t)$ for the synchronous and asynchronous cases respectively, where
$t$ is the number of iterations, with explicit data and network dependent
terms. Beyond favorable comparisons in terms of rate analysis, numerical
experiments provide empirical evidence the proposed algorithms surpasses the
previously introduced approach.

Bias in image restoration algorithms can hamper further analysis, typically
when the intensities have a physical meaning of interest, e.g., in medical
imaging. We propose to suppress a part of the bias  the method bias  while
leaving unchanged the other unavoidable part  the model bias. Our debiasing
technique can be used for any locally affine estimator including \^a1
regularization, anisotropic totalvariation and some nonlocal filters.

Photonlimited imaging arises when the number of photons collected by a
sensor array is small relative to the number of detector elements. Photon
limitations are an important concern for many applications such as spectral
imaging, night vision, nuclear medicine, and astronomy. Typically a Poisson
distribution is used to model these observations, and the inherent
heteroscedasticity of the data combined with standard noise removal methods
yields significant artifacts. This paper introduces a novel denoising algorithm
for photonlimited images which combines elements of dictionary learning and
sparse patchbased representations of images. The method employs both an
adaptation of Principal Component Analysis (PCA) for Poisson noise and recently
developed sparsityregularized convex optimization algorithms for
photonlimited images. A comprehensive empirical evaluation of the proposed
method helps characterize the performance of this approach relative to other
stateoftheart denoising methods. The results reveal that, despite its
conceptual simplicity, Poisson PCAbased denoising appears to be highly
competitive in very low light regimes.

This paper describes a simple image noise removal method which combines a
preprocessing step with the Yaroslavsky filter for strong numerical, visual,
and theoretical performance on a broad class of images. The framework developed
is a twostage approach. In the first stage the image is filtered with a
classical denoising method (e.g., wavelet or curvelet thresholding). In the
second stage a modification of the Yaroslavsky filter is performed on the
original noisy image, where the weights of the filters are governed by pixel
similarities in the denoised image from the first stage. Similar prefiltering
ideas have proved effective previously in the literature, and this paper
provides theoretical guarantees and important insight into why prefiltering can
be effective. Empirically, this simple approach achieves very good performance
for cartoon images, and can be computed much more quickly than current
patchbased denoising algorithms.

This paper describes a novel theoretical characterization of the performance
of nonlocal means (NLM) for noise removal. NLM has proven effective in a
variety of empirical studies, but little is understood fundamentally about how
it performs relative to classical methods based on wavelets or how various
parameters (e.g., patch size) should be chosen. For cartoon images and images
which may contain thin features and regular textures, the error decay rates of
NLM are derived and compared with those of linear filtering, oracle estimators,
variablebandwidth kernel methods, Yaroslavsky's filter and wavelet
thresholding estimators. The tradeoff between global and local search for
matching patches is examined, and the bias reduction associated with the local
polynomial regression version of NLM is analyzed. The theoretical results are
validated via simulations for 2D images corrupted by additive white Gaussian
noise.