
In this note, we extend the algorithms Extra and subgradientpush to a new
algorithm ExtraPush for consensus optimization with convex differentiable
objective functions over a directed network. When the stationary distribution
of the network can be computed in advance}, we propose a simplified algorithm
called Normalized ExtraPush. Just like Extra, both ExtraPush and Normalized
ExtraPush can iterate with a fixed step size. But unlike Extra, they can take a
columnstochastic mixing matrix, which is not necessarily doubly stochastic.
Therefore, they remove the undirectednetwork restriction of Extra.
Subgradientpush, while also works for directed networks, is slower on the same
type of problem because it must use a sequence of diminishing step sizes.
We present preliminary analysis for ExtraPush under a bounded sequence
assumption. For Normalized ExtraPush, we show that it naturally produces a
bounded, linearly convergent sequence provided that the objective function is
strongly convex.
In our numerical experiments, ExtraPush and Normalized ExtraPush performed
similarly well. They are significantly faster than subgradientpush, even when
we handoptimize the step sizes for the latter.

Convolutional sparse representations are a form of sparse representation with
a structured, translation invariant dictionary. Most convolutional dictionary
learning algorithms to date operate in batch mode, requiring simultaneous
access to all training images during the learning process, which results in
very high memory usage and severely limits the training data that can be used.
Very recently, however, a number of authors have considered the design of
online convolutional dictionary learning algorithms that offer far better
scaling of memory and computational cost with training set size than batch
methods. This paper extends our prior work, improving a number of aspects of
our previous algorithm; proposing an entirely new one, with better performance,
and that supports the inclusion of a spatial mask for learning from incomplete
data; and providing a rigorous theoretical analysis of these methods.

In this paper, we analyze the convergence of the alternating direction method
of multipliers (ADMM) for minimizing a nonconvex and possibly nonsmooth
objective function, $\phi(x_0,\ldots,x_p,y)$, subject to coupled linear
equality constraints. Our ADMM updates each of the primal variables
$x_0,\ldots,x_p,y$, followed by updating the dual variable. We separate the
variable $y$ from $x_i$'s as it has a special role in our analysis.
The developed convergence guarantee covers a variety of nonconvex functions
such as piecewise linear functions, $\ell_q$ quasinorm, Schatten$q$
quasinorm ($0<q<1$), minimax concave penalty (MCP), and smoothly clipped
absolute deviation (SCAD) penalty. It also allows nonconvex constraints such as
compact manifolds (e.g., spherical, Stiefel, and Grassman manifolds) and linear
complementarity constraints. Also, the $x_0$block can be almost any lower
semicontinuous function.
By applying our analysis, we show, for the first time, that several ADMM
algorithms applied to solve nonconvex models in statistical learning,
optimization on manifold, and matrix decomposition are guaranteed to converge.
Our results provide sufficient conditions for ADMM to converge on (convex or
nonconvex) monotropic programs with three or more blocks, as they are special
cases of our model.
ADMM has been regarded as a variant to the augmented Lagrangian method (ALM).
We present a simple example to illustrate how ADMM converges but ALM diverges
with bounded penalty parameter $\beta$. Indicated by this example and other
analysis in this paper, ADMM might be a better choice than ALM for some
nonconvex \emph{nonsmooth} problems, because ADMM is not only easier to
implement, it is also more likely to converge for the concerned scenarios.

It is known that operator splitting methods based on Forward Backward
Splitting (FBS), DouglasRachford Splitting (DRS), and DavisYin Splitting
(DYS) decompose a difficult optimization problems into simpler subproblems
under proper convexity and smoothness assumptions. In this paper, we identify
an envelope (an objective function) whose gradient descent iteration under a
variable metric coincides with DYS iteration. This result generalizes the
Moreau envelope for proximalpoint iteration and the envelopes for FBS and DRS
iterations identified by Patrinos, Stella, and Themelis.
Based on the new envelope and the StableCenter Manifold Theorem, we further
show that, when FBS or DRS iterations start from random points, they avoid all
strict saddle points with probability one. This results extends the similar
results by Lee et al. from gradient descent to splitting methods.

We design fast numerical methods for HamiltonJacobi equations in density
space (HJD), which arises in optimal transport and mean field games. We
overcome the curseofinfinitedimensionality nature of HJD by proposing a
generalized Hopf formula in density space. The formula transfers optimal
control problems in density space, which are constrained minimizations
supported on both spatial and time variables, to optimization problems over
only spatial variables. This transformation allows us to compute HJD
efficiently via multilevel approaches and coordinate descent methods.

$L_1$ regularization is used for finding sparse solutions to an
underdetermined linear system. As sparse signals are widely expected in remote
sensing, this type of regularization scheme and its extensions have been widely
employed in many remote sensing problems, such as image fusion, target
detection, image superresolution, and others and have led to promising
results. However, solving such sparse reconstruction problems is
computationally expensive and has limitations in its practical use. In this
paper, we proposed a novel efficient algorithm for solving the complexvalued
$L_1$ regularized least squares problem. Taking the highdimensional
tomographic synthetic aperture radar (TomoSAR) as a practical example, we
carried out extensive experiments, both with simulation data and real data, to
demonstrate that the proposed approach can retain the accuracy of second order
methods while dramatically speeding up the processing by one or two orders.
Although we have chosen TomoSAR as the example, the proposed method can be
generally applied to any spectral estimation problems.

This paper addresses consensus optimization problem in a multiagent network,
where all the agents collaboratively find a common minimizer to the sum of
their private functions. Our goal is to develop a decentralized algorithm in
which there is no center agent and each agent only communicates with its
neighbors.
Stateoftheart decentralized algorithms for consensus optimization with
convex objectives use fixed step sizes but involve communications among either
{\em all}, or a {\em random subset}, of the agents at each iteration. Another
approach is to employ a \emph{random walk incremental} strategy, which
activates a succession of nodes and their links, only one node and one link
each time; since the existing algorithms in this approach require diminishing
step sizes to converge to the optimal solution, its convergence is relatively
slow.
In this work, we propose a random walk algorithm that uses a fixed step size
and converges faster than the existing random walk incremental algorithms. It
is also communication efficient. We derive our algorithm by modifying ADMM and
analyze its convergence. We establish linear convergence for least squares
problems, along with a stateoftheart communication complexity. Numerical
experiments verify our analyses.

In this paper, we propose the Asynchronous Accelerated Nonuniform Randomized
Block Coordinate Descent algorithm ($\texttt{A2BCD}$), the first asynchronous
Nesterovaccelerated algorithm that achieves optimal complexity. This parallel
algorithm solves the unconstrained convex minimization problem, using $p$
computing nodes which compute updates to shared solution vectors, in an
asynchronous fashion with no central coordination. Nodes in asynchronous
algorithms do not wait for updates from other nodes before starting a new
iteration, but simply compute updates using the most recent solution
information available. This allows them to complete iterations much faster than
traditional ones, especially at scale, by eliminating the costly
synchronization penalty of traditional algorithms.
We first prove that $\texttt{A2BCD}$ converges linearly to a solution with a
fast accelerated rate that matches the recently proposed $\texttt{NU_ACDM}$, so
long as the maximum delay is not too large. Somewhat surprisingly,
$\texttt{A2BCD}$ pays no complexity penalty for using outdated information. We
then prove lower complexity bounds for randomized coordinate descent methods,
which show that $\texttt{A2BCD}$ (and hence $\texttt{NU_ACDM}$) has optimal
complexity to within a constant factor. We confirm with numerical experiments
that $\texttt{A2BCD}$ outperforms $\texttt{NU_ACDM}$, which is the current
fastest coordinate descent algorithm, even at small scale. We also derive and
analyze a secondorder ordinary differential equation, which is the
continuoustime limit of our algorithm, and prove it converges linearly to a
solution with a similar accelerated rate.

Performance of distributed optimization and learning systems is bottlenecked
by "straggler" nodes and slow communication links, which significantly delay
computation. We propose a distributed optimization framework where the dataset
is "encoded" to have an overcomplete representation with builtin redundancy,
and the straggling nodes in the system are dynamically left out of the
computation at every iteration, whose loss is compensated by the embedded
redundancy. We show that oblivious application of several popular optimization
algorithms on encoded data, including gradient descent, LBFGS, proximal
gradient under data parallelism, and coordinate descent under model
parallelism, converge to either approximate or exact solutions of the original
problem when stragglers are treated as erasures. These convergence results are
deterministic, i.e., they establish sample path convergence for arbitrary
sequences of delay patterns or distributions on the nodes, and are independent
of the tail behavior of the delay distribution. We demonstrate that equiangular
tight frames have desirable properties as encoding matrices, and propose
efficient mechanisms for encoding largescale data. We implement the proposed
technique on Amazon EC2 clusters, and demonstrate its performance over several
learning problems, including matrix factorization, LASSO, ridge regression and
logistic regression, and compare the proposed method with uncoded,
asynchronous, and data replication strategies.

In this paper, we develop algorithms to overcome the curse of dimensionality
in possibly nonconvex statedependent HamiltonJacobi equations (HJ PDEs)
arising from optimal control and differential game problems. The subproblems
are independent and can be implemented in an embarrassingly parallel fashion.
This is an ideal setup for perfect scaling in parallel computing. The algorithm
is proposed to overcome the curse of dimensionality [1, 2] when solving HJ PDE.
The major contribution of the paper is to change an optimization problem over a
space of curves to an optimization problem of a single vector, which goes
beyond [23]. We extend [5, 6, 8], and conjecture a (Laxtype) minimization
principle when the Hamiltonian is convex, as well as a (Hopftype) maximization
principle when the Hamiltonian is nonconvex. The conjectured Hopftype
maximization principle is a generalization of the wellknown Hopf formula [11,
16, 30]. We validated formula under restricted assumptions, and bring our
readers to [57] which validates that our conjectures in a more general setting
after a previous version of our paper. We conjectured the weakest assumption is
a psuedoconvexity assumption similar to [46]. The optimization problems are of
the same dimension as that of the HJ PDE. We suggest a coordinate descent
method for the minimization procedure in the generalized Lax/Hopf formula, and
numerical differentiation is used to compute the derivatives. This method is
preferable since the evaluation of the function value itself requires some
computational effort, especially when we handle high dimensional optimization
problem. The use of multiple initial guesses and a certificate of correctness
are suggested to overcome possibly multiple local extrema since the
optimization process is no longer convex. Our method is expected to have
application in control theory and differential game problems, and elsewhere.

Consensus optimization has received considerable attention in recent years. A
number of decentralized algorithms have been proposed for {convex} consensus
optimization. However, to the behaviors or consensus \emph{nonconvex}
optimization, our understanding is more limited.
When we lose convexity, we cannot hope our algorithms always return global
solutions though they sometimes still do sometimes. Somewhat surprisingly, the
decentralized consensus algorithms, DGD and ProxDGD, retain most other
properties that are known in the convex setting. In particular, when
diminishing (or constant) step sizes are used, we can prove convergence to a
(or a neighborhood of) consensus stationary solution under some regular
assumptions. It is worth noting that ProxDGD can handle nonconvex nonsmooth
functions if their proximal operators can be computed. Such functions include
SCAD and $\ell_q$ quasinorms, $q\in[0,1)$. Similarly, ProxDGD can take the
constraint to a nonconvex set with an easy projection.
To establish these properties, we have to introduce a completely different
line of analysis, as well as modify existing proofs that were used the convex
setting.

Slow running or straggler tasks can significantly reduce computation speed in
distributed computation. Recently, codingtheoryinspired approaches have been
applied to mitigate the effect of straggling, through embedding redundancy in
certain linear computational steps of the optimization algorithm, thus
completing the computation without waiting for the stragglers. In this paper,
we propose an alternate approach where we embed the redundancy directly in the
data itself, and allow the computation to proceed completely oblivious to
encoding. We propose several encoding schemes, and demonstrate that popular
batch algorithms, such as gradient descent and LBFGS, applied in a
codingoblivious manner, deterministically achieve sample path linear
convergence to an approximate solution of the original problem, using an
arbitrarily varying subset of the nodes at each iteration. Moreover, this
approximation can be controlled by the amount of redundancy and the number of
nodes used in each iteration. We provide experimental results demonstrating the
advantage of the approach over uncoded and data replication strategies.

Deep neural networks (DNNs) have shown very promising results for various
image restoration (IR) tasks. However, the design of network architectures
remains a major challenging for achieving further improvements. While most
existing DNNbased methods solve the IR problems by directly mapping low
quality images to desirable highquality images, the observation models
characterizing the image degradation processes have been largely ignored. In
this paper, we first propose a denoisingbased IR algorithm, whose iterative
steps can be computed efficiently. Then, the iterative process is unfolded into
a deep neural network, which is composed of multiple denoisers modules
interleaved with backprojection (BP) modules that ensure the observation
consistencies. A convolutional neural network (CNN) based denoiser that can
exploit the multiscale redundancies of natural images is proposed. As such,
the proposed network not only exploits the powerful denoising ability of DNNs,
but also leverages the prior of the observation model. Through endtoend
training, both the denoisers and the BP modules can be jointly optimized.
Experimental results on several IR tasks, e.g., image denoising,
superresolution and deblurring show that the proposed method can lead to very
competitive and often stateoftheart results on several IR tasks, including
image denoising, deblurring and superresolution.

Despite the vast literature on DRS, there has been very little work analyzing
their behavior under pathologies. Most analyses assume a primal solution
exists, a dual solution exists, and strong duality holds. When these
assumptions are not met, i.e., under pathologies, the theory often breaks down
and the empirical performance may degrade significantly. In this paper, we
establish that DRS only requires strong duality to work, in the sense that
asymptotically iterates are approximately feasible and approximately optimal.

Many optimization algorithms converge to stationary points. When the
underlying problem is nonconvex, they may get trapped at local minimizers and
occasionally stagnate near saddle points. We propose the RunandInspect
Method, which adds an "inspect" phase to existing algorithms that helps escape
from nonglobal stationary points. The inspection samples a set of points in a
radius $R$ around the current point. When a sample point yields a sufficient
decrease in the objective, we move there and resume an existing algorithm. If
no sufficient decrease is found, the current point is called an approximate
$R$local minimizer. We show that an $R$local minimizer is globally optimal,
up to a specific error depending on $R$, if the objective function can be
implicitly decomposed into a smooth convex function plus a restricted function
that is possibly nonconvex, nonsmooth. For highdimensional problems, we
introduce blockwise inspections to overcome the curse of dimensionality while
still maintaining optimality bounds up to a factor equal to the number of
blocks. Our method performs well on a set of artificial and realistic nonconvex
problems by coupling with gradient descent, coordinate descent, EM, and
proxlinear algorithms.

In this paper, we present the proximalproximalgradient method (PPG), a
novel optimization method that is simple to implement and simple to
parallelize. PPG generalizes the proximalgradient method and ADMM and is
applicable to minimization problems written as a sum of many differentiable and
many nondifferentiable convex functions. The nondifferentiable functions can
be coupled. We furthermore present a related stochastic variation, which we
call stochastic PPG (SPPG). SPPG can be interpreted as a generalization of
Finito and MISO over to the sum of many coupled nondifferentiable convex
functions. We present many applications that can benefit from PPG and SPPG and
prove convergence for both methods. A key strength of PPG and SPPG is,
compared to existing methods, its ability to directly handle a large sum of
nondifferentiable nonseparable functions with a constant stepsize independent
of the number of functions. Such nondiminishing stepsizes allows them to be
fast.

The selection of beam orientations, which is a key step in radiation
treatment planning, is particularly challenging for noncoplanar radiotherapy
systems due to the large number of candidate beams. In this paper, we report
progress on the group sparsity approach to beam orientation optimization,
wherein beam angles are selected by solving a large scale fluence map
optimization problem with an additional group sparsity penalty term that
encourages most candidate beams to be inactive. The optimization problem is
solved using an accelerated proximal gradient method, the Fast Iterative
ShrinkageThresholding Algorithm (FISTA). We derive a closedform expression
for a relevant proximal operator which enables the application of FISTA. The
proposed algorithm is used to create noncoplanar treatment plans for four
cases (including head and neck, lung, and prostate cases), and the resulting
plans are compared with clinical plans. The dosimetric quality of the group
sparsity treatment plans is superior to that of the clinical plans. Moreover,
the runtime for the group sparsity approach is typically about 5 minutes.
Problems of this size could not be handled using the previous group sparsity
method for beam orientation optimization, which was slow to solve much smaller
coplanar cases. This work demonstrates for the first time that the group
sparsity approach, when combined with an accelerated proximal gradient method
such as FISTA, works effectively for noncoplanar cases with 500800 candidate
beams.

In this paper, we present a method for identifying infeasible, unbounded, and
pathological conic programs based on DouglasRachford splitting, or
equivalently ADMM. When an optimization program is infeasible, unbounded, or
pathological, the iterates of DouglasRachford splitting diverge. Somewhat
surprisingly, such divergent iterates still provide useful information, which
our method uses for identification. In addition, for strongly infeasible
problems the method produces a separating hyperplane and informs the user on
how to minimally modify the given problem to achieve strong feasibility. As a
firstorder method, the proposed algorithm relies on simple subroutines, and
therefore is simple to implement and has low periteration cost.

While a number of different algorithms have recently been proposed for
convolutional dictionary learning, this remains an expensive problem. The
single biggest impediment to learning from large training sets is the memory
requirements, which grow at least linearly with the size of the training set
since all existing methods are batch algorithms. The work reported here
addresses this limitation by extending online dictionary learning ideas to the
convolutional context.

In this paper, we consider the convergence of a very general
asynchronousparallel algorithm called ARock, that takes many wellknown
asynchronous algorithms as special cases (gradient descent, proximal gradient,
Douglas Rachford, ADMM, etc.). In asynchronousparallel algorithms, the
computing nodes simply use the most recent information that they have access
to, instead of waiting for a full update from all nodes in the system. This
means that nodes do not have to waste time waiting for information, which can
be a major bottleneck, especially in distributed systems. When the system has
$p$ nodes, asynchronous algorithms may complete $\Theta(\ln(p))$ more
iterations than synchronous algorithms in a given time period ("more iterations
per second").
Although asynchronous algorithms may compute more iterations per second,
there is error associated with using outdated information. How many more
iterations in total are needed to compensate for this error is still an open
question. The main results of this paper aim to answer this question. We prove,
loosely, that as the size of the problem becomes large, the number of
additional iterations that asynchronous algorithms need becomes negligible
compared to the total number ("same quality" of the iterations). Taking these
facts together, our results provide solid evidence of the potential of
asynchronous algorithms to vastly speed up certain distributed computations.

The need for scalable numerical solutions has motivated the development of
asynchronous parallel algorithms, where a set of nodes run in parallel with
little or no synchronization, thus computing with delayed information. This
paper studies the convergence of the asynchronous parallel algorithm ARock
under potentially unbounded delays.
ARock is a general asynchronous algorithm that has many applications. It
parallelizes fixedpoint iterations by letting a set of nodes randomly choose
solution coordinates and update them in an asynchronous parallel fashion. ARock
takes some recent asynchronous coordinate descent algorithms as special cases
and gives rise to new asynchronous operatorsplitting algorithms. Existing
analysis of ARock assumes the delays to be bounded, and uses this bound to set
a step size that is important to both convergence and efficiency. Other work,
though allowing unbounded delays, imposes strict conditions on the underlying
fixedpoint operator, resulting in limited applications.
In this paper, convergence is established under unbounded delays, which can
be either stochastic or deterministic. The proposed step sizes are more
practical and generally larger than those in the existing work. The step size
adapts to the delay distribution or the current delay being experienced in the
system. New Lyapunov functions, which are the key to analyzing asynchronous
algorithms, are generated to obtain our results. A set of applicable
optimization algorithms with largescale applications are given, including
machine learning and scientific computing algorithms.

Asynchronousparallel algorithms have the potential to vastly speed up
algorithms by eliminating costly synchronization. However, our understanding to
these algorithms is limited because the current convergence of asynchronous
(block) coordinate descent algorithms are based on somewhat unrealistic
assumptions. In particular, the age of the shared optimization variables being
used to update a block is assumed to be independent of the block being updated.
Also, it is assumed that the updates are applied to randomly chosen blocks. In
this paper, we argue that these assumptions either fail to hold or will imply
less efficient implementations. We then prove the convergence of
asynchronousparallel block coordinate descent under more realistic
assumptions, in particular, always without the independence assumption. The
analysis permits both the deterministic (essentially) cyclic and random rules
for block choices. Because a bound on the asynchronous delays may or may not be
available, we establish convergence for both bounded delays and unbounded
delays. The analysis also covers nonconvex, weakly convex, and strongly convex
functions. We construct Lyapunov functions that directly model both objective
progress and delays, so delays are not treated errors or noise. A
continuoustime ODE is provided to explain the construction at a high level.

We propose an asynchronous, decentralized algorithm for consensus
optimization. The algorithm runs over a network in which the agents communicate
with their neighbors and perform local computation. In the proposed algorithm,
each agent can compute and communicate independently at different times, for
different durations, with the information it has even if the latest information
from its neighbors is not yet available. Such an asynchronous algorithm reduces
the time that agents would otherwise waste idle because of communication delays
or because their neighbors are slower. It also eliminates the need for a global
clock for synchronization. Mathematically, the algorithm involves both primal
and dual variables, uses fixed stepsize parameters, and provably converges to
the exact solution under a bounded delay assumption and a random agent
assumption. When running synchronously, the algorithm performs just as well as
existing competitive synchronous algorithms such as PGEXTRA, which diverges
without synchronization. Numerical experiments confirm the theoretical findings
and illustrate the performance of the proposed algorithm.

Many problems reduce to the fixedpoint problem of solving $x=T(x)$. To this
problem, we apply the coordinateupdate algorithms, which update only one or a
few components of $x$ at each step. When each update is cheap, these algorithms
are faster than the full fixedpoint iteration (which updates all the
components).
In this paper, we focus on the coordinateupdate algorithms based on the
cyclic selection rules, where the ordering of coordinates in each cycle is
arbitrary. These algorithms are fast, but their convergence is unknown in the
fixedpoint setting.
When $T$ is a nonexpansive operator and has a fixed point, we show that the
sequence of coordinateupdate iterates converges to a fixed point under proper
step sizes. This result applies to the primaldual coordinateupdate
algorithms, which have applications to optimization problems with nonseparable
nonsmooth objectives, as well as global linear constraints.
Numerically, we apply coordinateupdate algorithms with the cyclic, shuffled
cyclic, and random selection rules to $\ell_1$ robust least squares, a CT image
reconstruction problem, as well as nonnegative matrix factorization. They
converge much faster than the standard fixedpoint iteration. Among the three
rules, cyclic and shuffled cyclic rules are overall faster than the random
rule.

This monograph presents a class of algorithms called coordinate descent
algorithms for mathematicians, statisticians, and engineers outside the field
of optimization. This particular class of algorithms has recently gained
popularity due to their effectiveness in solving largescale optimization
problems in machine learning, compressed sensing, image processing, and
computational statistics. Coordinate descent algorithms solve optimization
problems by successively minimizing along each coordinate or coordinate
hyperplane, which is ideal for parallelized and distributed computing. Avoiding
detailed technicalities and proofs, this monograph gives relevant theory and
examples for practitioners to effectively apply coordinate descent to modern
problems in data science and engineering.