
We propose a conditional gradient framework for a composite convex
minimization template with broad applications. Our approach combines the
notions of smoothing and homotopy under the CGM framework, and provably
achieves the optimal $\mathcal{O}(1/\sqrt{k})$ convergence rate. We demonstrate
that the same rate holds if the linear subproblems are solved approximately
with additive or multiplicative error. Specific applications of the framework
include the nonsmooth minimization, semidefinite programming, and minimization
with linear inclusion constraints over a compact domain. We provide numerical
evidence to demonstrate the benefits of the new framework.

We propose new restarting strategies for the accelerated coordinate descent
method. Our main contribution is to show that for a well chosen sequence of
restarting times, the restarted method has a nearly geometric rate of
convergence. A major feature of the method is that it can take profit of the
local quadratic error bound of the objective function without knowing the
actual value of the error bound. We also show that under the more restrictive
assumption that the objective function is strongly convex, any fixed restart
period leads to a geometric rate of convergence. Finally, we illustrate the
properties of the algorithm on a regularized logistic regression problem and on
a Lasso problem.

We propose a new firstorder primaldual optimization framework for a convex
optimization template with broad applications. Our optimization algorithms
feature optimal convergence guarantees under a variety of common structure
assumptions on the problem template. Our analysis relies on a novel combination
of three classic ideas applied to the primaldual gap function: smoothing,
acceleration, and homotopy. The algorithms due to the new approach achieve the
best known convergence rate results, in particular when the template consists
of only nonsmooth functions. We also outline a restart strategy for the
acceleration to significantly enhance the practical performance. We demonstrate
relations with the augmented Lagrangian method and show how to exploit the
strongly convex objectives with rigorous convergence rate guarantees. We
provide numerical evidence with two examples and illustrate that the new
methods can outperform the stateoftheart, including ChambollePock, and the
alternating direction methodofmultipliers algorithms.

This paper introduces a coordinate descent version of the V\~uCondat
algorithm. By coordinate descent, we mean that only a subset of the coordinates
of the primal and dual iterates is updated at each iteration, the other
coordinates being maintained to their past value. Our method allows us to solve
optimization problems with a combination of differentiable functions,
constraints as well as nonseparable and nondifferentiable regularizers. We
show that the sequences generated by our algorithm converge to a saddle point
of the problem at stake, for a wider range of parameter values than previous
methods. In particular, the condition on the stepsizes depends on the
coordinatewise Lipschitz constant of the differentiable function's gradient,
which is a major feature allowing classical coordinate descent to perform so
well when it is applicable. We then prove a sublinear rate of convergence in
general and a linear rate of convergence if the objective enjoys strong
convexity properties. We illustrate the performances of the algorithm on a
totalvariation regularized least squares regression problem and on large scale
support vector machine problems.

In high dimensional regression settings, sparsity enforcing penalties have
proved useful to regularize the datafitting term. A recently introduced
technique called screening rules propose to ignore some variables in the
optimization leveraging the expected sparsity of the solutions and consequently
leading to faster solvers. When the procedure is guaranteed not to discard
variables wrongly the rules are said to be safe. In this work, we propose a
unifying framework for generalized linear models regularized with standard
sparsity enforcing penalties such as $\ell_1$ or $\ell_1/\ell_2$ norms. Our
technique allows to discard safely more variables than previously considered
safe rules, particularly for low regularization parameters. Our proposed Gap
Safe rules (so called because they rely on duality gap computation) can cope
with any iterative solver but are particularly well suited to (block)
coordinate descent methods. Applied to many standard learning tasks, Lasso,
SparseGroup Lasso, multitask Lasso, binary and multinomial logistic
regression, etc., we report significant speedups compared to previously
proposed safe rules on all tested data sets.

In high dimension, it is customary to consider Lassotype estimators to
enforce sparsity. For standard Lasso theory to hold, the regularization
parameter should be proportional to the noise level, yet the latter is
generally unknown in practice. A possible remedy is to consider estimators,
such as the Concomitant/Scaled Lasso, which jointly optimize over the
regression coefficients as well as over the noise level, making the choice of
the regularization independent of the noise level. However, when data from
different sources are pooled to increase sample size, or when dealing with
multimodal datasets, noise levels typically differ and new dedicated estimators
are needed. In this work we provide new statistical and computational solutions
to deal with such heteroscedastic regression models, with an emphasis on
functional brain imaging with combined magneto and electroencephalographic
(M/EEG) signals. Adopting the formulation of Concomitant Lassotype estimators,
we propose a jointly convex formulation to estimate both the regression
coefficients and the (square root of the) noise covariance. When our framework
is instantiated to decorrelated noise, it leads to an efficient algorithm
whose computational cost is not higher than for the Lasso and Concomitant
Lasso, while addressing more complex noise structures. Numerical experiments
demonstrate that our estimator yields improved prediction and support
identification while correctly estimating the noise (square root) covariance.
Results on multimodal neuroimaging problems with M/EEG data are also reported.

By analyzing accelerated proximal gradient methods under a local quadratic
growth condition, we show that restarting these algorithms at any frequency
gives a globally linearly convergent algorithm. This result was previously
known only for long enough frequencies. Then, as the rate of convergence
depends on the match between the frequency and the quadratic error bound, we
design a scheme to automatically adapt the frequency of restart from the
observed decrease of the norm of the gradient mapping. Our algorithm has a
better theoretical bound than previously proposed methods for the adaptation to
the quadratic error bound of the objective. We illustrate the efficiency of the
algorithm on a Lasso problem and on a regularized logistic regression problem.

We introduce and analyze an algorithm for the minimization of convex
functions that are the sum of differentiable terms and proximable terms
composed with linear operators. The method builds upon the recently developed
smoothed gap technique. In addition to a precise convergence rate result, valid
even in the presence of linear inclusion constraints, this new method allows an
explicit treatment of the gradient of differentiable functions and can be
enhanced with linesearch. We also study the consequences of restarting the
acceleration of the algorithm at a given frequency. These new features are not
classical for primaldual methods and allow us to solve difficult largescale
convex optimization problems. We numerically illustrate the superior
performance of the algorithm on basis pursuit, TVregularized least squares
regression and L1 regression problems against the stateoftheart.

We propose new restarting strategies for accelerated gradient and accelerated
coordinate descent methods. Our main contribution is to show that the restarted
method has a geometric rate of convergence for any restarting frequency, and so
it allows us to take profit of restarting even when we do not know the strong
convexity coefficient. The scheme can be combined with adaptive restarting,
leading to the first provable convergence for adaptive restarting schemes with
accelerated gradient methods. Finally, we illustrate the properties of the
algorithm on a regularized logistic regression problem and on a Lasso problem.

In high dimensional settings, sparse structures are crucial for efficiency,
both in term of memory, computation and performance. It is customary to
consider $\ell_1$ penalty to enforce sparsity in such scenarios. Sparsity
enforcing methods, the Lasso being a canonical example, are popular candidates
to address high dimension. For efficiency, they rely on tuning a parameter
trading data fitting versus sparsity. For the Lasso theory to hold this tuning
parameter should be proportional to the noise level, yet the latter is often
unknown in practice. A possible remedy is to jointly optimize over the
regression parameter as well as over the noise level. This has been considered
under several names in the literature: ScaledLasso, Squareroot Lasso,
Concomitant Lasso estimation for instance, and could be of interest for
confidence sets or uncertainty quantification. In this work, after illustrating
numerical difficulties for the Smoothed Concomitant Lasso formulation, we
propose a modification we coined Smoothed Concomitant Lasso, aimed at
increasing numerical stability. We propose an efficient and accurate solver
leading to a computational cost no more expansive than the one for the Lasso.
We leverage on standard ingredients behind the success of fast Lasso solvers: a
coordinate descent algorithm, combined with safe screening rules to achieve
speed efficiency, by eliminating early irrelevant features.

In high dimensional settings, sparse structures are crucial for efficiency,
either in term of memory, computation or performance. In some contexts, it is
natural to handle more refined structures than pure sparsity, such as for
instance group sparsity. SparseGroup Lasso has recently been introduced in the
context of linear regression to enforce sparsity both at the feature level and
at the group level. We adapt to the case of SparseGroup Lasso recent safe
screening rules that discard early in the solver irrelevant features/groups.
Such rules have led to important speedups for a wide range of iterative
methods. Thanks to dual gap computations, we provide new safe screening rules
for SparseGroup Lasso and show significant gains in term of computing time for
a coordinate descent implementation.

Screening rules allow to early discard irrelevant variables from the
optimization in Lasso problems, or its derivatives, making solvers faster. In
this paper, we propose new versions of the socalled $\textit{safe rules}$ for
the Lasso. Based on duality gap considerations, our new rules create safe test
regions whose diameters converge to zero, provided that one relies on a
converging solver. This property helps screening out more variables, for a
wider range of regularization parameter values. In addition to faster
convergence, we prove that we correctly identify the active sets (supports) of
the solutions in finite time. While our proposed strategy can cope with any
solver, its performance is demonstrated using a coordinate descent algorithm
particularly adapted to machine learning use cases. Significant computing time
reductions are obtained with respect to previous safe rules.

High dimensional regression benefits from sparsity promoting regularizations.
Screening rules leverage the known sparsity of the solution by ignoring some
variables in the optimization, hence speeding up solvers. When the procedure is
proven not to discard features wrongly the rules are said to be \emph{safe}. In
this paper we derive new safe rules for generalized linear models regularized
with $\ell_1$ and $\ell_1/\ell_2$ norms. The rules are based on duality gap
computations and spherical safe regions whose diameters converge to zero. This
allows to discard safely more variables, in particular for low regularization
parameters. The GAP Safe rule can cope with any iterative solver and we
illustrate its performance on coordinate descent for multitask Lasso, binary
and multinomial logistic regression, demonstrating significant speed ups on all
tested datasets with respect to previous safe rules.

We propose a new algorithm for minimizing regularized empirical loss:
Stochastic Dual Newton Ascent (SDNA). Our method is dual in nature: in each
iteration we update a random subset of the dual variables. However, unlike
existing methods such as stochastic dual coordinate ascent, SDNA is capable of
utilizing all curvature information contained in the examples, which leads to
striking improvements in both theory and practice  sometimes by orders of
magnitude. In the special case when an L2regularizer is used in the primal,
the dual problem is a concave quadratic maximization problem plus a separable
term. In this regime, SDNA in each step solves a proximal subproblem involving
a random principal submatrix of the Hessian of the quadratic function; whence
the name of the method. If, in addition, the loss functions are quadratic, our
method can be interpreted as a novel variant of the recently introduced
Iterative Hessian Sketch.

We propose an efficient distributed randomized coordinate descent method for
minimizing regularized nonstrongly convex loss functions. The method attains
the optimal $O(1/k^2)$ convergence rate, where $k$ is the iteration counter.
The core of the work is the theoretical study of stepsize parameters. We have
implemented the method on Archer  the largest supercomputer in the UK  and
show that the method is capable of solving a (synthetic) LASSO optimization
problem with 50 billion variables.

We propose a new stochastic coordinate descent method for minimizing the sum
of convex functions each of which depends on a small number of coordinates
only. Our method (APPROX) is simultaneously Accelerated, Parallel and PROXimal;
this is the first time such a method is proposed. In the special case when the
number of processors is equal to the number of coordinates, the method
converges at the rate $2\bar{\omega}\bar{L} R^2/(k+1)^2 $, where $k$ is the
iteration counter, $\bar{\omega}$ is an average degree of separability of the
loss function, $\bar{L}$ is the average of Lipschitz constants associated with
the coordinates and individual functions in the sum, and $R$ is the distance of
the initial point from the minimizer. We show that the method can be
implemented without the need to perform fulldimensional vector operations,
which is the major bottleneck of existing accelerated coordinate descent
methods. The fact that the method depends on the average degree of
separability, and not on the maximum degree of separability, can be attributed
to the use of new safe large stepsizes, leading to improved expected separable
overapproximation (ESO). These are of independent interest and can be utilized
in all existing parallel stochastic coordinate descent algorithms based on the
concept of ESO.

We design a randomised parallel version of Adaboost based on previous studies
on parallel coordinate descent. The algorithm uses the fact that the logarithm
of the exponential loss is a function with coordinatewise Lipschitz continuous
gradient, in order to define the step lengths. We provide the proof of
convergence for this randomised Adaboost algorithm and a theoretical
parallelisation speedup factor. We finally provide numerical examples on
learning problems of various sizes that show that the algorithm is competitive
with concurrent approaches, especially for large scale problems.

We study the performance of a family of randomized parallel coordinate
descent methods for minimizing the sum of a nonsmooth and separable convex
functions. The problem class includes as a special case L1regularized L1
regression and the minimization of the exponential loss ("AdaBoost problem").
We assume the input data defining the loss function is contained in a sparse
$m\times n$ matrix $A$ with at most $\omega$ nonzeros in each row. Our methods
need $O(n \beta/\tau)$ iterations to find an approximate solution with high
probability, where $\tau$ is the number of processors and $\beta = 1 +
(\omega1)(\tau1)/(n1)$ for the fastest variant. The notation hides
dependence on quantities such as the required accuracy and confidence levels
and the distance of the starting iterate from an optimal point. Since
$\beta/\tau$ is a decreasing function of $\tau$, the method needs fewer
iterations when more processors are used. Certain variants of our algorithms
perform on average only $O(\nnz(A)/n)$ arithmetic operations during a single
iteration per processor and, because $\beta$ decreases when $\omega$ does,
fewer iterations are needed for sparser problems.

The HOTS algorithm uses the hyperlink structure of the web to compute a
vector of scores with which one can rank web pages. The HOTS vector is the
vector of the exponentials of the dual variables of an optimal flow problem
(the "temperature" of each page). The flow represents an optimal distribution
of web surfers on the web graph in the sense of entropy maximization.
In this paper, we prove the convergence of Tomlin's HOTS algorithm. We first
study a simplified version of the algorithm, which is a fixed point scaling
algorithm designed to solve the matrix balancing problem for nonnegative
irreducible matrices. The proof of convergence is general (nonlinear
PerronFrobenius theory) and applies to a family of deformations of HOTS. Then,
we address the effective HOTS algorithm, designed by Tomlin for the ranking of
web pages. The model is a network entropy maximization problem generalizing
matrix balancing. We show that, under mild assumptions, the HOTS algorithm
converges with a linear convergence rate. The proof relies on a uniqueness
property of the fixed point and on the existence of a Lyapunov function.
We also show that the coordinate descent algorithm can be used to find the
ideal and effective HOTS vectors and we compare HOTS and coordinate descent on
fragments of the web graph. Our numerical experiments suggest that the
convergence rate of the HOTS algorithm may deteriorate when the size of the
input increases. We thus give a normalized version of HOTS with an
experimentally better convergence rate.

We give a new link spam detection and PageRank demotion algorithm called
MaxRank. Like TrustRank and AntiTrustRank, it starts with a seed of handpicked
trusted and spam pages. We define the MaxRank of a page as the frequency of
visit of this page by a random surfer minimizing an average cost per time unit.
On a given page, the random surfer selects a set of hyperlinks and clicks with
uniform probability on any of these hyperlinks. The cost function penalizes
spam pages and hyperlink removals. The goal is to determine a hyperlink
deletion policy that minimizes this score. The MaxRank is interpreted as a
modified PageRank vector, used to sort web pages instead of the usual PageRank
vector. The bias vector of this ergodic control problem, which is unique up to
an additive constant, is a measure of the "spamicity" of each page, used to
detect spam pages. We give a scalable algorithm for MaxRank computation that
allowed us to perform experimental results on the WEBSPAMUK2007 dataset. We
show that our algorithm outperforms both TrustRank and AntiTrustRank for spam
and nonspam page detection.

In the last years, Google's PageRank optimization problems have been
extensively studied. In that case, the ranking is given by the invariant
measure of a stochastic matrix. In this paper, we consider the more general
situation in which the ranking is determined by the Perron eigenvector of a
nonnegative, but not necessarily stochastic, matrix, in order to cover
Kleinberg's HITS algorithm. We also give some results for Tomlin's HOTS
algorithm. The problem consists then in finding an optimal outlink strategy
subject to design constraints and for a given search engine.
We study the relaxed versions of these problems, which means that we should
accept weighted hyperlinks. We provide an efficient algorithm for the
computation of the matrix of partial derivatives of the criterion, that uses
the low rank property of this matrix. We give a scalable algorithm that couples
gradient and power iterations and gives a local minimum of the Perron vector
optimization problem. We prove convergence by considering it as an approximate
gradient method.
We then show that optimal linkage stategies of HITS and HOTS optimization
problems verify a threshold property. We report numerical results on fragments
of the real web graph for these search engine optimization problems.

We study a general class of PageRank optimization problems which consist in
finding an optimal outlink strategy for a web site subject to design
constraints. We consider both a continuous problem, in which one can choose the
intensity of a link, and a discrete one, in which in each page, there are
obligatory links, facultative links and forbidden links. We show that the
continuous problem, as well as its discrete variant when there are no
constraints coupling different pages, can both be modeled by constrained Markov
decision processes with ergodic reward, in which the webmaster determines the
transition probabilities of websurfers. Although the number of actions turns
out to be exponential, we show that an associated polytope of transition
measures has a concise representation, from which we deduce that the continuous
problem is solvable in polynomial time, and that the same is true for the
discrete problem when there are no coupling constraints. We also provide
efficient algorithms, adapted to very large networks. Then, we investigate the
qualitative features of optimal outlink strategies, and identify in particular
assumptions under which there exists a "master" page to which all controlled
pages should point. We report numerical results on fragments of the real web
graph.