• In this note, we extend the algorithms Extra and subgradient-push to a new algorithm ExtraPush for consensus optimization with convex differentiable objective functions over a directed network. When the stationary distribution of the network can be computed in advance}, we propose a simplified algorithm called Normalized ExtraPush. Just like Extra, both ExtraPush and Normalized ExtraPush can iterate with a fixed step size. But unlike Extra, they can take a column-stochastic mixing matrix, which is not necessarily doubly stochastic. Therefore, they remove the undirected-network restriction of Extra. Subgradient-push, while also works for directed networks, is slower on the same type of problem because it must use a sequence of diminishing step sizes. We present preliminary analysis for ExtraPush under a bounded sequence assumption. For Normalized ExtraPush, we show that it naturally produces a bounded, linearly convergent sequence provided that the objective function is strongly convex. In our numerical experiments, ExtraPush and Normalized ExtraPush performed similarly well. They are significantly faster than subgradient-push, even when we hand-optimize the step sizes for the latter.
  • Convolutional sparse representations are a form of sparse representation with a structured, translation invariant dictionary. Most convolutional dictionary learning algorithms to date operate in batch mode, requiring simultaneous access to all training images during the learning process, which results in very high memory usage and severely limits the training data that can be used. Very recently, however, a number of authors have considered the design of online convolutional dictionary learning algorithms that offer far better scaling of memory and computational cost with training set size than batch methods. This paper extends our prior work, improving a number of aspects of our previous algorithm; proposing an entirely new one, with better performance, and that supports the inclusion of a spatial mask for learning from incomplete data; and providing a rigorous theoretical analysis of these methods.
  • In this paper, we analyze the convergence of the alternating direction method of multipliers (ADMM) for minimizing a nonconvex and possibly nonsmooth objective function, $\phi(x_0,\ldots,x_p,y)$, subject to coupled linear equality constraints. Our ADMM updates each of the primal variables $x_0,\ldots,x_p,y$, followed by updating the dual variable. We separate the variable $y$ from $x_i$'s as it has a special role in our analysis. The developed convergence guarantee covers a variety of nonconvex functions such as piecewise linear functions, $\ell_q$ quasi-norm, Schatten-$q$ quasi-norm ($0<q<1$), minimax concave penalty (MCP), and smoothly clipped absolute deviation (SCAD) penalty. It also allows nonconvex constraints such as compact manifolds (e.g., spherical, Stiefel, and Grassman manifolds) and linear complementarity constraints. Also, the $x_0$-block can be almost any lower semi-continuous function. By applying our analysis, we show, for the first time, that several ADMM algorithms applied to solve nonconvex models in statistical learning, optimization on manifold, and matrix decomposition are guaranteed to converge. Our results provide sufficient conditions for ADMM to converge on (convex or nonconvex) monotropic programs with three or more blocks, as they are special cases of our model. ADMM has been regarded as a variant to the augmented Lagrangian method (ALM). We present a simple example to illustrate how ADMM converges but ALM diverges with bounded penalty parameter $\beta$. Indicated by this example and other analysis in this paper, ADMM might be a better choice than ALM for some nonconvex \emph{nonsmooth} problems, because ADMM is not only easier to implement, it is also more likely to converge for the concerned scenarios.
  • It is known that operator splitting methods based on Forward Backward Splitting (FBS), Douglas-Rachford Splitting (DRS), and Davis-Yin Splitting (DYS) decompose a difficult optimization problems into simpler subproblems under proper convexity and smoothness assumptions. In this paper, we identify an envelope (an objective function) whose gradient descent iteration under a variable metric coincides with DYS iteration. This result generalizes the Moreau envelope for proximal-point iteration and the envelopes for FBS and DRS iterations identified by Patrinos, Stella, and Themelis. Based on the new envelope and the Stable-Center Manifold Theorem, we further show that, when FBS or DRS iterations start from random points, they avoid all strict saddle points with probability one. This results extends the similar results by Lee et al. from gradient descent to splitting methods.
  • We design fast numerical methods for Hamilton-Jacobi equations in density space (HJD), which arises in optimal transport and mean field games. We overcome the curse-of-infinite-dimensionality nature of HJD by proposing a generalized Hopf formula in density space. The formula transfers optimal control problems in density space, which are constrained minimizations supported on both spatial and time variables, to optimization problems over only spatial variables. This transformation allows us to compute HJD efficiently via multi-level approaches and coordinate descent methods.
  • $L_1$ regularization is used for finding sparse solutions to an underdetermined linear system. As sparse signals are widely expected in remote sensing, this type of regularization scheme and its extensions have been widely employed in many remote sensing problems, such as image fusion, target detection, image super-resolution, and others and have led to promising results. However, solving such sparse reconstruction problems is computationally expensive and has limitations in its practical use. In this paper, we proposed a novel efficient algorithm for solving the complex-valued $L_1$ regularized least squares problem. Taking the high-dimensional tomographic synthetic aperture radar (TomoSAR) as a practical example, we carried out extensive experiments, both with simulation data and real data, to demonstrate that the proposed approach can retain the accuracy of second order methods while dramatically speeding up the processing by one or two orders. Although we have chosen TomoSAR as the example, the proposed method can be generally applied to any spectral estimation problems.
  • This paper addresses consensus optimization problem in a multi-agent network, where all the agents collaboratively find a common minimizer to the sum of their private functions. Our goal is to develop a decentralized algorithm in which there is no center agent and each agent only communicates with its neighbors. State-of-the-art decentralized algorithms for consensus optimization with convex objectives use fixed step sizes but involve communications among either {\em all}, or a {\em random subset}, of the agents at each iteration. Another approach is to employ a \emph{random walk incremental} strategy, which activates a succession of nodes and their links, only one node and one link each time; since the existing algorithms in this approach require diminishing step sizes to converge to the optimal solution, its convergence is relatively slow. In this work, we propose a random walk algorithm that uses a fixed step size and converges faster than the existing random walk incremental algorithms. It is also communication efficient. We derive our algorithm by modifying ADMM and analyze its convergence. We establish linear convergence for least squares problems, along with a state-of-the-art communication complexity. Numerical experiments verify our analyses.
  • In this paper, we propose the Asynchronous Accelerated Nonuniform Randomized Block Coordinate Descent algorithm ($\texttt{A2BCD}$), the first asynchronous Nesterov-accelerated algorithm that achieves optimal complexity. This parallel algorithm solves the unconstrained convex minimization problem, using $p$ computing nodes which compute updates to shared solution vectors, in an asynchronous fashion with no central coordination. Nodes in asynchronous algorithms do not wait for updates from other nodes before starting a new iteration, but simply compute updates using the most recent solution information available. This allows them to complete iterations much faster than traditional ones, especially at scale, by eliminating the costly synchronization penalty of traditional algorithms. We first prove that $\texttt{A2BCD}$ converges linearly to a solution with a fast accelerated rate that matches the recently proposed $\texttt{NU_ACDM}$, so long as the maximum delay is not too large. Somewhat surprisingly, $\texttt{A2BCD}$ pays no complexity penalty for using outdated information. We then prove lower complexity bounds for randomized coordinate descent methods, which show that $\texttt{A2BCD}$ (and hence $\texttt{NU_ACDM}$) has optimal complexity to within a constant factor. We confirm with numerical experiments that $\texttt{A2BCD}$ outperforms $\texttt{NU_ACDM}$, which is the current fastest coordinate descent algorithm, even at small scale. We also derive and analyze a second-order ordinary differential equation, which is the continuous-time limit of our algorithm, and prove it converges linearly to a solution with a similar accelerated rate.
  • Performance of distributed optimization and learning systems is bottlenecked by "straggler" nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is "encoded" to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left out of the computation at every iteration, whose loss is compensated by the embedded redundancy. We show that oblivious application of several popular optimization algorithms on encoded data, including gradient descent, L-BFGS, proximal gradient under data parallelism, and coordinate descent under model parallelism, converge to either approximate or exact solutions of the original problem when stragglers are treated as erasures. These convergence results are deterministic, i.e., they establish sample path convergence for arbitrary sequences of delay patterns or distributions on the nodes, and are independent of the tail behavior of the delay distribution. We demonstrate that equiangular tight frames have desirable properties as encoding matrices, and propose efficient mechanisms for encoding large-scale data. We implement the proposed technique on Amazon EC2 clusters, and demonstrate its performance over several learning problems, including matrix factorization, LASSO, ridge regression and logistic regression, and compare the proposed method with uncoded, asynchronous, and data replication strategies.
  • In this paper, we develop algorithms to overcome the curse of dimensionality in possibly non-convex state-dependent Hamilton-Jacobi equations (HJ PDEs) arising from optimal control and differential game problems. The subproblems are independent and can be implemented in an embarrassingly parallel fashion. This is an ideal setup for perfect scaling in parallel computing. The algorithm is proposed to overcome the curse of dimensionality [1, 2] when solving HJ PDE. The major contribution of the paper is to change an optimization problem over a space of curves to an optimization problem of a single vector, which goes beyond [23]. We extend [5, 6, 8], and conjecture a (Lax-type) minimization principle when the Hamiltonian is convex, as well as a (Hopf-type) maximization principle when the Hamiltonian is non-convex. The conjectured Hopf-type maximization principle is a generalization of the well-known Hopf formula [11, 16, 30]. We validated formula under restricted assumptions, and bring our readers to [57] which validates that our conjectures in a more general setting after a previous version of our paper. We conjectured the weakest assumption is a psuedoconvexity assumption similar to [46]. The optimization problems are of the same dimension as that of the HJ PDE. We suggest a coordinate descent method for the minimization procedure in the generalized Lax/Hopf formula, and numerical differentiation is used to compute the derivatives. This method is preferable since the evaluation of the function value itself requires some computational effort, especially when we handle high dimensional optimization problem. The use of multiple initial guesses and a certificate of correctness are suggested to overcome possibly multiple local extrema since the optimization process is no longer convex. Our method is expected to have application in control theory and differential game problems, and elsewhere.
  • Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been proposed for {convex} consensus optimization. However, to the behaviors or consensus \emph{nonconvex} optimization, our understanding is more limited. When we lose convexity, we cannot hope our algorithms always return global solutions though they sometimes still do sometimes. Somewhat surprisingly, the decentralized consensus algorithms, DGD and Prox-DGD, retain most other properties that are known in the convex setting. In particular, when diminishing (or constant) step sizes are used, we can prove convergence to a (or a neighborhood of) consensus stationary solution under some regular assumptions. It is worth noting that Prox-DGD can handle nonconvex nonsmooth functions if their proximal operators can be computed. Such functions include SCAD and $\ell_q$ quasi-norms, $q\in[0,1)$. Similarly, Prox-DGD can take the constraint to a nonconvex set with an easy projection. To establish these properties, we have to introduce a completely different line of analysis, as well as modify existing proofs that were used the convex setting.
  • Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy directly in the data itself, and allow the computation to proceed completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the amount of redundancy and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and data replication strategies.
  • Deep neural networks (DNNs) have shown very promising results for various image restoration (IR) tasks. However, the design of network architectures remains a major challenging for achieving further improvements. While most existing DNN-based methods solve the IR problems by directly mapping low quality images to desirable high-quality images, the observation models characterizing the image degradation processes have been largely ignored. In this paper, we first propose a denoising-based IR algorithm, whose iterative steps can be computed efficiently. Then, the iterative process is unfolded into a deep neural network, which is composed of multiple denoisers modules interleaved with back-projection (BP) modules that ensure the observation consistencies. A convolutional neural network (CNN) based denoiser that can exploit the multi-scale redundancies of natural images is proposed. As such, the proposed network not only exploits the powerful denoising ability of DNNs, but also leverages the prior of the observation model. Through end-to-end training, both the denoisers and the BP modules can be jointly optimized. Experimental results on several IR tasks, e.g., image denoising, super-resolution and deblurring show that the proposed method can lead to very competitive and often state-of-the-art results on several IR tasks, including image denoising, deblurring and super-resolution.
  • Despite the vast literature on DRS, there has been very little work analyzing their behavior under pathologies. Most analyses assume a primal solution exists, a dual solution exists, and strong duality holds. When these assumptions are not met, i.e., under pathologies, the theory often breaks down and the empirical performance may degrade significantly. In this paper, we establish that DRS only requires strong duality to work, in the sense that asymptotically iterates are approximately feasible and approximately optimal.
  • Many optimization algorithms converge to stationary points. When the underlying problem is nonconvex, they may get trapped at local minimizers and occasionally stagnate near saddle points. We propose the Run-and-Inspect Method, which adds an "inspect" phase to existing algorithms that helps escape from non-global stationary points. The inspection samples a set of points in a radius $R$ around the current point. When a sample point yields a sufficient decrease in the objective, we move there and resume an existing algorithm. If no sufficient decrease is found, the current point is called an approximate $R$-local minimizer. We show that an $R$-local minimizer is globally optimal, up to a specific error depending on $R$, if the objective function can be implicitly decomposed into a smooth convex function plus a restricted function that is possibly nonconvex, nonsmooth. For high-dimensional problems, we introduce blockwise inspections to overcome the curse of dimensionality while still maintaining optimality bounds up to a factor equal to the number of blocks. Our method performs well on a set of artificial and realistic nonconvex problems by coupling with gradient descent, coordinate descent, EM, and prox-linear algorithms.
  • In this paper, we present the proximal-proximal-gradient method (PPG), a novel optimization method that is simple to implement and simple to parallelize. PPG generalizes the proximal-gradient method and ADMM and is applicable to minimization problems written as a sum of many differentiable and many non-differentiable convex functions. The non-differentiable functions can be coupled. We furthermore present a related stochastic variation, which we call stochastic PPG (S-PPG). S-PPG can be interpreted as a generalization of Finito and MISO over to the sum of many coupled non-differentiable convex functions. We present many applications that can benefit from PPG and S-PPG and prove convergence for both methods. A key strength of PPG and S-PPG is, compared to existing methods, its ability to directly handle a large sum of non-differentiable non-separable functions with a constant stepsize independent of the number of functions. Such non-diminishing stepsizes allows them to be fast.
  • The selection of beam orientations, which is a key step in radiation treatment planning, is particularly challenging for non-coplanar radiotherapy systems due to the large number of candidate beams. In this paper, we report progress on the group sparsity approach to beam orientation optimization, wherein beam angles are selected by solving a large scale fluence map optimization problem with an additional group sparsity penalty term that encourages most candidate beams to be inactive. The optimization problem is solved using an accelerated proximal gradient method, the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA). We derive a closed-form expression for a relevant proximal operator which enables the application of FISTA. The proposed algorithm is used to create non-coplanar treatment plans for four cases (including head and neck, lung, and prostate cases), and the resulting plans are compared with clinical plans. The dosimetric quality of the group sparsity treatment plans is superior to that of the clinical plans. Moreover, the runtime for the group sparsity approach is typically about 5 minutes. Problems of this size could not be handled using the previous group sparsity method for beam orientation optimization, which was slow to solve much smaller coplanar cases. This work demonstrates for the first time that the group sparsity approach, when combined with an accelerated proximal gradient method such as FISTA, works effectively for non-coplanar cases with 500-800 candidate beams.
  • In this paper, we present a method for identifying infeasible, unbounded, and pathological conic programs based on Douglas-Rachford splitting, or equivalently ADMM. When an optimization program is infeasible, unbounded, or pathological, the iterates of Douglas-Rachford splitting diverge. Somewhat surprisingly, such divergent iterates still provide useful information, which our method uses for identification. In addition, for strongly infeasible problems the method produces a separating hyperplane and informs the user on how to minimally modify the given problem to achieve strong feasibility. As a first-order method, the proposed algorithm relies on simple subroutines, and therefore is simple to implement and has low per-iteration cost.
  • While a number of different algorithms have recently been proposed for convolutional dictionary learning, this remains an expensive problem. The single biggest impediment to learning from large training sets is the memory requirements, which grow at least linearly with the size of the training set since all existing methods are batch algorithms. The work reported here addresses this limitation by extending online dictionary learning ideas to the convolutional context.
  • In this paper, we consider the convergence of a very general asynchronous-parallel algorithm called ARock, that takes many well-known asynchronous algorithms as special cases (gradient descent, proximal gradient, Douglas Rachford, ADMM, etc.). In asynchronous-parallel algorithms, the computing nodes simply use the most recent information that they have access to, instead of waiting for a full update from all nodes in the system. This means that nodes do not have to waste time waiting for information, which can be a major bottleneck, especially in distributed systems. When the system has $p$ nodes, asynchronous algorithms may complete $\Theta(\ln(p))$ more iterations than synchronous algorithms in a given time period ("more iterations per second"). Although asynchronous algorithms may compute more iterations per second, there is error associated with using outdated information. How many more iterations in total are needed to compensate for this error is still an open question. The main results of this paper aim to answer this question. We prove, loosely, that as the size of the problem becomes large, the number of additional iterations that asynchronous algorithms need becomes negligible compared to the total number ("same quality" of the iterations). Taking these facts together, our results provide solid evidence of the potential of asynchronous algorithms to vastly speed up certain distributed computations.
  • The need for scalable numerical solutions has motivated the development of asynchronous parallel algorithms, where a set of nodes run in parallel with little or no synchronization, thus computing with delayed information. This paper studies the convergence of the asynchronous parallel algorithm ARock under potentially unbounded delays. ARock is a general asynchronous algorithm that has many applications. It parallelizes fixed-point iterations by letting a set of nodes randomly choose solution coordinates and update them in an asynchronous parallel fashion. ARock takes some recent asynchronous coordinate descent algorithms as special cases and gives rise to new asynchronous operator-splitting algorithms. Existing analysis of ARock assumes the delays to be bounded, and uses this bound to set a step size that is important to both convergence and efficiency. Other work, though allowing unbounded delays, imposes strict conditions on the underlying fixed-point operator, resulting in limited applications. In this paper, convergence is established under unbounded delays, which can be either stochastic or deterministic. The proposed step sizes are more practical and generally larger than those in the existing work. The step size adapts to the delay distribution or the current delay being experienced in the system. New Lyapunov functions, which are the key to analyzing asynchronous algorithms, are generated to obtain our results. A set of applicable optimization algorithms with large-scale applications are given, including machine learning and scientific computing algorithms.
  • Asynchronous-parallel algorithms have the potential to vastly speed up algorithms by eliminating costly synchronization. However, our understanding to these algorithms is limited because the current convergence of asynchronous (block) coordinate descent algorithms are based on somewhat unrealistic assumptions. In particular, the age of the shared optimization variables being used to update a block is assumed to be independent of the block being updated. Also, it is assumed that the updates are applied to randomly chosen blocks. In this paper, we argue that these assumptions either fail to hold or will imply less efficient implementations. We then prove the convergence of asynchronous-parallel block coordinate descent under more realistic assumptions, in particular, always without the independence assumption. The analysis permits both the deterministic (essentially) cyclic and random rules for block choices. Because a bound on the asynchronous delays may or may not be available, we establish convergence for both bounded delays and unbounded delays. The analysis also covers nonconvex, weakly convex, and strongly convex functions. We construct Lyapunov functions that directly model both objective progress and delays, so delays are not treated errors or noise. A continuous-time ODE is provided to explain the construction at a high level.
  • We propose an asynchronous, decentralized algorithm for consensus optimization. The algorithm runs over a network in which the agents communicate with their neighbors and perform local computation. In the proposed algorithm, each agent can compute and communicate independently at different times, for different durations, with the information it has even if the latest information from its neighbors is not yet available. Such an asynchronous algorithm reduces the time that agents would otherwise waste idle because of communication delays or because their neighbors are slower. It also eliminates the need for a global clock for synchronization. Mathematically, the algorithm involves both primal and dual variables, uses fixed step-size parameters, and provably converges to the exact solution under a bounded delay assumption and a random agent assumption. When running synchronously, the algorithm performs just as well as existing competitive synchronous algorithms such as PG-EXTRA, which diverges without synchronization. Numerical experiments confirm the theoretical findings and illustrate the performance of the proposed algorithm.
  • Many problems reduce to the fixed-point problem of solving $x=T(x)$. To this problem, we apply the coordinate-update algorithms, which update only one or a few components of $x$ at each step. When each update is cheap, these algorithms are faster than the full fixed-point iteration (which updates all the components). In this paper, we focus on the coordinate-update algorithms based on the cyclic selection rules, where the ordering of coordinates in each cycle is arbitrary. These algorithms are fast, but their convergence is unknown in the fixed-point setting. When $T$ is a nonexpansive operator and has a fixed point, we show that the sequence of coordinate-update iterates converges to a fixed point under proper step sizes. This result applies to the primal-dual coordinate-update algorithms, which have applications to optimization problems with nonseparable nonsmooth objectives, as well as global linear constraints. Numerically, we apply coordinate-update algorithms with the cyclic, shuffled cyclic, and random selection rules to $\ell_1$ robust least squares, a CT image reconstruction problem, as well as nonnegative matrix factorization. They converge much faster than the standard fixed-point iteration. Among the three rules, cyclic and shuffled cyclic rules are overall faster than the random rule.
  • This monograph presents a class of algorithms called coordinate descent algorithms for mathematicians, statisticians, and engineers outside the field of optimization. This particular class of algorithms has recently gained popularity due to their effectiveness in solving large-scale optimization problems in machine learning, compressed sensing, image processing, and computational statistics. Coordinate descent algorithms solve optimization problems by successively minimizing along each coordinate or coordinate hyperplane, which is ideal for parallelized and distributed computing. Avoiding detailed technicalities and proofs, this monograph gives relevant theory and examples for practitioners to effectively apply coordinate descent to modern problems in data science and engineering.