• ### On stochastic and deterministic quasi-Newton methods for non-Strongly convex optimization: Asymptotic convergence and rate analysis(1710.05509)

Oct. 1, 2019 math.OC
Motivated by applications arising from large scale optimization and machine learning, we consider stochastic quasi-Newton (SQN) methods for solving unconstrained convex optimization problems. The convergence analysis of the SQN methods, both full and limited-memory variants, require the objective function to be strongly convex. However, this assumption is fairly restrictive and does not hold for applications such as minimizing the logistic regression loss function. To the best of our knowledge, no rate statements currently exist for SQN methods in the absence of such an assumption. Also, among the existing first-order methods for addressing stochastic optimization problems with merely convex objectives, those equipped with provable convergence rates employ averaging. However, this averaging technique has a detrimental impact on inducing sparsity. Motivated by these gaps, the main contributions of the paper are as follows: (i) Addressing large scale stochastic optimization problems, we develop an iteratively regularized stochastic limited-memory BFGS (IRS-LBFGS) algorithm, where the stepsize, regularization parameter, and the Hessian inverse approximation matrix are updated iteratively. We establish the convergence to an optimal solution of the original problem both in an almost-sure and mean senses. We derive the convergence rate in terms of the objective function's values and show that it is of the order $\mathcal{O}\left(k^{-\left(\frac{1}{3}-\epsilon\right)}\right)$, where $\epsilon$ is an arbitrary small positive scalar; (ii) In deterministic regime, we show that the regularized limited-memory BFGS algorithm displays a rate of the order $\mathcal{O}\left(\frac{1}{k^{1 -\epsilon'}}\right)$, where $\epsilon'$ is an arbitrary small positive scalar. We present our numerical experiments performed on a large scale text classification problem.
• ### Improved Convergence Rates for Distributed Resource Allocation(1706.05441)

Dec. 16, 2018 cs.MA, cs.DC, math.OC
In this paper, we develop a class of decentralized algorithms for solving a convex resource allocation problem in a network of $n$ agents, where the agent objectives are decoupled while the resource constraints are coupled. The agents communicate over a connected undirected graph, and they want to collaboratively determine a solution to the overall network problem, while each agent only communicates with its neighbors. We first study the connection between the decentralized resource allocation problem and the decentralized consensus optimization problem. Then, using a class of algorithms for solving consensus optimization problems, we propose a novel class of decentralized schemes for solving resource allocation problems in a distributed manner. Specifically, we first propose an algorithm for solving the resource allocation problem with an $o(1/k)$ convergence rate guarantee when the agents' objective functions are generally convex (could be nondifferentiable) and per agent local convex constraints are allowed; We then propose a gradient-based algorithm for solving the resource allocation problem when per agent local constraints are absent and show that such scheme can achieve geometric rate when the objective functions are strongly convex and have Lipschitz continuous gradients. We have also provided scalability/network dependency analysis. Based on these two algorithms, we have further proposed a gradient projection-based algorithm which can handle smooth objective and simple constraints more efficiently. Numerical experiments demonstrates the viability and performance of all the proposed algorithms.
• ### A Distributed Stochastic Gradient Tracking Method(1803.07741)

March 30, 2018 cs.MA, cs.DC, math.OC
In this paper, we study the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex. The global objective is to find a common solution that minimizes the average of all cost functions. Assuming agents only have access to unbiased estimates of the gradients of their local cost functions, we consider a distributed stochastic gradient tracking method. We show that, in expectation, the iterates generated by each agent are attracted to a neighborhood of the optimal solution, where they accumulate exponentially fast (under a constant step size choice). More importantly, the limiting (expected) error bounds on the distance of the iterates from the optimal solution decrease with the network size, which is a comparable performance to a centralized stochastic gradient algorithm. Numerical examples further demonstrate the effectiveness of the method.
• ### SUCAG: Stochastic Unbiased Curvature-aided Gradient Method for Distributed Optimization(1803.08198)

March 22, 2018 math.OC, stat.ML
We propose and analyze a new stochastic gradient method, which we call Stochastic Unbiased Curvature-aided Gra- dient (SUCAG), for finite sum optimization problems. SUCAG constitutes an unbiased total gradient tracking technique that uses Hessian information to accelerate convergence. We an- alyze our method under the general asynchronous model of computation, in which functions are selected infinitely often, but with delays that can grow sublinearly. For strongly convex problems, we establish linear convergence for the SUCAG method. When the initialization point is sufficiently close to the optimal solution, the established convergence rate is only dependent on the condition number of the problem, making it strictly faster than the known rate for the SAGA method. Furthermore, we describe a Markov-driven approach of implementing the SUCAG method in a distributed asynchronous multi-agent setting, via gossiping along a random walk on the communication graph. We show that our analysis applies as long as the undirected graph is connected and, notably, establishes an asymptotic linear convergence rate that is robust to the graph topology. Numerical results demonstrate the merit of our algorithm over existing methods.
• ### Distributed Computation of Wasserstein Barycenters over Networks(1803.02933)

March 8, 2018 cs.MA, cs.DC, math.OC, stat.ML
We propose a new class-optimal algorithm for the distributed computation of Wasserstein Barycenters over networks. Assuming that each node in a graph has a probability distribution, we prove that every node is able to reach the barycenter of all distributions held in the network by using local interactions compliant with the topology of the graph. We show the minimum number of communication rounds required for the proposed method to achieve arbitrary relative precision both in the optimality of the solution and the consensus among all agents for undirected fixed networks.
• ### Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization(1709.08765)

Jan. 15, 2018 cs.MA, cs.DC, math.OC
In decentralized optimization, nodes cooperate to minimize an overall objective function that is the sum (or average) of per-node private objective functions. Algorithms interleave local computations with communication among all or a subset of the nodes. Motivated by a variety of applications---distributed estimation in sensor networks, fitting models to massive data sets, and distributed control of multi-robot systems, to name a few---significant advances have been made towards the development of robust, practical algorithms with theoretical performance guarantees. This paper presents an overview of recent work in this area. In general, rates of convergence depend not only on the number of nodes involved and the desired level of accuracy, but also on the structure and nature of the network over which nodes communicate (e.g., whether links are directed or undirected, static or time-varying). We survey the state-of-the-art algorithms and their analyses tailored to these different scenarios, highlighting the role of the network topology.
• ### Discrete-Time Polar Opinion Dynamics with Susceptibility(1709.07121)

Sept. 21, 2017 cs.MA, math.DS, cs.SI, cs.SY
This paper considers a discrete-time opinion dynamics model in which each individual's susceptibility to being influenced by others is dependent on her current opinion. We assume that the social network has time-varying topology and that the opinions are scalars on a continuous interval. We first propose a general opinion dynamics model based on the DeGroot model, with a general function to describe the functional dependence of each individual's susceptibility on her own opinion, and show that this general model is analogous to the Friedkin-Johnsen model, which assumes a constant susceptibility for each individual. We then consider two specific functions in which the individual's susceptibility depends on the \emph{polarity} of her opinion, and provide motivating social examples. First, we consider stubborn positives, who have reduced susceptibility if their opinions are at one end of the interval and increased susceptibility if their opinions are at the opposite end. A court jury is used as a motivating example. Second, we consider stubborn neutrals, who have reduced susceptibility when their opinions are in the middle of the spectrum, and our motivating examples are social networks discussing established social norms or institutionalized behavior. For each specific susceptibility model, we establish the initial and graph topology conditions in which consensus is reached, and develop necessary and sufficient conditions on the initial conditions for the final consensus value to be at either extreme of the opinion interval. Simulations are provided to show the effects of the susceptibility function when compared to the DeGroot model.
• ### Multi-Competitive Viruses over Static and Time--Varying Networks(1702.07754)

May 15, 2017 math.DS, math.OC
Epidemic processes are used commonly for modeling and analysis of biological networks, computer networks, and human contact networks. The idea of competing viruses has been explored recently, motivated by the spread of different ideas along different social networks. Previous studies of competitive viruses have focused only on two viruses and on static graph structures. In this paper, we consider multiple competing viruses over static and dynamic graph structures, and investigate the eradication and propagation of diseases in these systems. Stability analysis for the class of models we consider is performed and an antidote control technique is proposed.
• ### Fast Convergence Rates for Distributed Non-Bayesian Learning(1508.05161)

April 11, 2017 math.OC
We consider the problem of distributed learning, where a network of agents collectively aim to agree on a hypothesis that best explains a set of distributed observations of conditionally independent random processes. We propose a distributed algorithm and establish consistency, as well as a non-asymptotic, explicit and geometric convergence rate for the concentration of the beliefs around the set of optimal hypotheses. Additionally, if the agents interact over static networks, we provide an improved learning protocol with better scalability with respect to the number of nodes in the network.
• ### Distributed Learning for Cooperative Inference(1704.02718)

April 10, 2017 math.PR, cs.MA, math.OC, cs.LG, stat.ML
We study the problem of cooperative inference where a group of agents interact over a network and seek to estimate a joint parameter that best explains a set of observations. Agents do not know the network topology or the observations of other agents. We explore a variational interpretation of the Bayesian posterior density, and its relation to the stochastic mirror descent algorithm, to propose a new distributed learning algorithm. We show that, under appropriate assumptions, the beliefs generated by the proposed algorithm concentrate around the true parameter exponentially fast. We provide explicit non-asymptotic bounds for the convergence rate. Moreover, we develop explicit and computationally efficient algorithms for observation models belonging to exponential families.
• ### Distributed Gaussian Learning over Time-varying Directed Graphs(1612.01600)

Dec. 7, 2016 cs.MA, math.OC, cs.LG, stat.ML, cs.SY
We present a distributed (non-Bayesian) learning algorithm for the problem of parameter estimation with Gaussian noise. The algorithm is expressed as explicit updates on the parameters of the Gaussian beliefs (i.e. means and precision). We show a convergence rate of $O(1/k)$ with the constant term depending on the number of agents and the topology of the network. Moreover, we show almost sure convergence to the optimal solution of the estimation problem for the general case of time-varying directed graphs.
• ### A Tutorial on Distributed (Non-Bayesian) Learning: Problem, Algorithms and Results(1609.07537)

Sept. 23, 2016 cs.MA, cs.SI, math.OC, cs.LG, stat.ML
We overview some results on distributed learning with focus on a family of recently proposed algorithms known as non-Bayesian social learning. We consider different approaches to the distributed learning problem and its algorithmic solutions for the case of finitely many hypotheses. The original centralized problem is discussed at first, and then followed by a generalization to the distributed setting. The results on convergence and convergence rate are presented for both asymptotic and finite time regimes. Various extensions are discussed such as those dealing with directed time-varying networks, Nesterov's acceleration technique and a continuum sets of hypothesis.
• ### Geometrically Convergent Distributed Optimization with Uncoordinated Step-Sizes(1609.05877)

Sept. 19, 2016 math.OC, stat.ML, cs.SY
A recent algorithmic family for distributed optimization, DIGing's, have been shown to have geometric convergence over time-varying undirected/directed graphs. Nevertheless, an identical step-size for all agents is needed. In this paper, we study the convergence rates of the Adapt-Then-Combine (ATC) variation of the DIGing algorithm under uncoordinated step-sizes. We show that the ATC variation of DIGing algorithm converges geometrically fast even if the step-sizes are different among the agents. In addition, our analysis implies that the ATC structure can accelerate convergence compared to the distributed gradient descent (DGD) structure which has been used in the original DIGing algorithm.
• ### Epidemic Processes over Time-Varying Networks(1609.05128)

Sept. 16, 2016 math.DS, math.OC
The spread of viruses in biological networks, computer networks, and human contact networks can have devastating effects; developing and analyzing mathematical models of these systems can be insightful and lead to societal benefits. Prior research has focused mainly on network models with static graph structures, however the systems being modeled typically have dynamic graph structures. Therefore to better understand and analyze virus spread, further study is required. In this paper, we consider virus spread models over networks with dynamic graph structures, and investigate the behavior of diseases in these systems. A stability analysis of epidemic processes over time-varying networks is performed, examining conditions for the disease free equilibrium, in both the deterministic and stochastic cases. We present simulation results, propose a number of corollaries based on these simulations, and discuss quarantine control via simulation.
• ### On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression(1606.03000)

June 9, 2016 cs.IT, math.IT, cs.LG
The problem of least squares regression of a $d$-dimensional unknown parameter is considered. A stochastic gradient descent based algorithm with weighted iterate-averaging that uses a single pass over the data is studied and its convergence rate is analyzed. We first consider a bounded constraint set of the unknown parameter. Under some standard regularity assumptions, we provide an explicit $O(1/k)$ upper bound on the convergence rate, depending on the variance (due to the additive noise in the measurements) and the size of the constraint set. We show that the variance term dominates the error and decreases with rate $1/k$, while the term which is related to the size of the constraint set decreases with rate $\log k/k^2$. We then compare the asymptotic ratio $\rho$ between the convergence rate of the proposed scheme and the empirical risk minimizer (ERM) as the number of iterations approaches infinity. We show that $\rho\leq 4$ under some mild conditions for all $d\geq 1$. We further improve the upper bound by showing that $\rho\leq 4/3$ for the case of $d=1$ and unbounded parameter set. Simulation results demonstrate strong performance of the algorithm as compared to existing methods, and coincide with $\rho\leq 4/3$ even for large $d$ in practice.
• ### Distributed Algorithms for Aggregative Games on Graphs(1605.00267)

June 8, 2016 math.OC
We consider a class of Nash games, termed as aggregative games, being played over a networked system. In an aggregative game, a player's objective is a function of the aggregate of all the players' decisions. Every player maintains an estimate of this aggregate, and the players exchange this information with their local neighbors over a connected network. We study distributed synchronous and asynchronous algorithms for information exchange and equilibrium computation over such a network. Under standard conditions, we establish the almost-sure convergence of the obtained sequences to the equilibrium point. We also consider extensions of our schemes to aggregative games where the players' objectives are coupled through a more general form of aggregate function. Finally, we present numerical results that demonstrate the performance of the proposed schemes.
• ### Coordinate Dual Averaging for Decentralized Online Optimization with Nonseparable Global Objectives(1508.07933)

May 20, 2016 math.OC, cs.LG, cs.SY
We consider a decentralized online convex optimization problem in a network of agents, where each agent controls only a coordinate (or a part) of the global decision vector. For such a problem, we propose two decentralized variants (ODA-C and ODA-PS) of Nesterov's primal-dual algorithm with dual averaging. In ODA-C, to mitigate the disagreements on the primal-vector updates, the agents implement a generalization of the local information-exchange dynamics recently proposed by Li and Marden over a static undirected graph. In ODA-PS, the agents implement the broadcast-based push-sum dynamics over a time-varying sequence of uniformly connected digraphs. We show that the regret bounds in both cases have sublinear growth of $O(\sqrt{T})$, with the time horizon $T$, when the stepsize is of the form $1/\sqrt{t}$ and the objective functions are Lipschitz-continuous convex functions with Lipschitz gradients. We also implement the proposed algorithms on a sensor network to complement our theoretical analysis.
• ### Distributed Learning with Infinitely Many Hypotheses(1605.02105)

May 6, 2016 math.OC, cs.LG, stat.ML
We consider a distributed learning setup where a network of agents sequentially access realizations of a set of random variables with unknown distributions. The network objective is to find a parametrized distribution that best describes their joint observations in the sense of the Kullback-Leibler divergence. Apart from recent efforts in the literature, we analyze the case of countably many hypotheses and the case of a continuum of hypotheses. We provide non-asymptotic bounds for the concentration rate of the agents' beliefs around the correct hypothesis in terms of the number of agents, the network parameters, and the learning abilities of the agents. Additionally, we provide a novel motivation for a general set of distributed Non-Bayesian update rules as instances of the distributed stochastic mirror descent algorithm.
• ### On the Analysis of a Continuous-Time Bi-Virus Model(1603.04098)

March 18, 2016 math.DS, math.OC
Motivated by the spread of opinions on different social networks, we study a distributed continuous-time bi-virus model for a system of groups of individuals. An in-depth stability analysis is performed for more general models than have been previously considered, for the healthy and epidemic states. In addition, we investigate sensitivity properties of some nontrivial equilibria and obtain an impossibility result for distributed feedback control.
• ### Stochastic quasi-Newton methods for non-strongly convex problems: convergence and rate analysis(1603.04547)

March 15, 2016 math.OC
Motivated by applications in optimization and machine learning, we consider stochastic quasi-Newton (SQN) methods for solving stochastic optimization problems. In the literature, the convergence analysis of these algorithms relies on strong convexity of the objective function. To our knowledge, no theoretical analysis is provided for the rate statements in the absence of this assumption. Motivated by this gap, we allow the objective function to be merely convex and we develop a cyclic regularized SQN method where the gradient mapping and the Hessian approximation matrix are both regularized at each iteration and are updated in a cyclic manner. We show that, under suitable assumptions on the stepsize and regularization parameters, the objective function value converges to the optimal objective function of the original problem in both almost sure and the expected senses. For each case, a class of feasible sequences that guarantees the convergence is provided. Moreover, the rate of convergence in terms of the objective function value is derived. Our empirical analysis on a binary classification problem shows that the proposed scheme performs well compared to both classic regularization SQN schemes and stochastic approximation method.
• ### Data Clustering and Graph Partitioning via Simulated Mixing(1603.04918)

March 15, 2016 cs.LG, stat.ML
Spectral clustering approaches have led to well-accepted algorithms for finding accurate clusters in a given dataset. However, their application to large-scale datasets has been hindered by computational complexity of eigenvalue decompositions. Several algorithms have been proposed in the recent past to accelerate spectral clustering, however they compromise on the accuracy of the spectral clustering to achieve faster speed. In this paper, we propose a novel spectral clustering algorithm based on a mixing process on a graph. Unlike the existing spectral clustering algorithms, our algorithm does not require computing eigenvectors. Specifically, it finds the equivalent of a linear combination of eigenvectors of the normalized similarity matrix weighted with corresponding eigenvalues. This linear combination is then used to partition the dataset into meaningful clusters. Simulations on real datasets show that partitioning datasets based on such linear combinations of eigenvectors achieves better accuracy than standard spectral clustering methods as the number of clusters increase. Our algorithm can easily be implemented in a distributed setting.
• ### On Smoothing, Regularization and Averaging in Stochastic Approximation Methods for Stochastic Variational Inequalities(1411.0209)

Jan. 5, 2016 math.OC
Traditionally, stochastic approximation schemes for SVIs have relied on strong monotonicity and Lipschitzian properties of the underlying map. In contrast, we consider monotone stochastic variational inequality (SVI) problems where the strong monotonicity and Lipschitzian assumptions on the mappings are weakened. In the first part of the paper, to address such shortcomings, a regularized smoothed SA (RSSA) scheme is developed wherein the stepsize, smoothing, and regularization parameters are diminishing sequences updated after every iteration. Under suitable assumptions on the sequences, we show that the algorithm generates iterates that converge to a solution in an almost sure sense, extending the results in [16] to the non-Lipschitzian regime. Motivated by the need to develop non-asymptotic rate statements, in the second part of the paper, we develop a variant of the RSSA scheme, denoted by aRSSA$_r$, in which we employ a weighted iterate-averaging, parametrized by a scalar $r$ where $r = 1$ provides us with the standard averaging scheme. We make several contributions in this context: First, we show that the gap function associated with the sequences by the aRSSA$_r$ scheme tends to zero when the parameter sequences are chosen appropriately. Second, we show that the gap function associated with the averaged sequence diminishes to zero at the optimal rate $\cal{O}(1/\sqrt{K})$ after $K$ steps when smoothing and regularization are suppressed and $r < 1$, thus improving the rate statement for the standard averaging which admits a rate of $\cal{O}(\ln(K)/\sqrt{K})$. Third, we develop a window-based variant of this scheme that also displays the optimal rate for $r < 1$. Notably, we prove the superiority of the scheme with $r < 1$ with its counterpart with $r=1$ in terms of the constant factor of the error bound when the size of the averaging window is sufficiently large.
• ### Network Independent Rates in Distributed Learning(1509.08574)

Sept. 29, 2015 math.OC
We propose a new belief update rule for Distributed Non-Bayesian learning in time-varying directed graphs, where a group of agents tries to collectively identify a hypothesis that best describes a sequence of observed data. We show that the proposed update rule, inspired by the Push-Sum algorithm, is consistent, moreover we provide an explicit characterization of its convergence rate. Our main result states that, after a transient time, all agents will concentrate their beliefs at a network independent rate. Network independent rates were not available for other consensus based distributed learning algorithms.
• ### Cloud-Based Centralized/Decentralized Multi-Agent Optimization with Communication Delays(1508.06230)

Aug. 25, 2015 math.OC
We present and analyze a computational hybrid architecture for performing multi-agent optimization. The optimization problems under consideration have convex objective and constraint functions with mild smoothness conditions imposed on them. For such problems, we provide a primal-dual algorithm implemented in the hybrid architecture, which consists of a decentralized network of agents into which centralized information is occasionally injected, and we establish its convergence properties. To accomplish this, a central cloud computer aggregates global information, carries out computations of the dual variables based on this information, and then distributes the updated dual variables to the agents. The agents update their (primal) state variables and also communicate among themselves with each agent sharing and receiving state information with some number of its neighbors. Throughout, communications with the cloud are not assumed to be synchronous or instantaneous, and communication delays are explicitly accounted for in the modeling and analysis of the system. Experimental results are presented to support the theoretical developments made.
• ### Nonasymptotic Convergence Rates for Cooperative Learning Over Time-Varying Directed Graphs(1410.1977)

Aug. 21, 2015 math.OC
We study the problem of distributed hypothesis testing with a network of agents where some agents repeatedly gain access to information about the correct hypothesis. The group objective is to globally agree on a joint hypothesis that best describes the observed data at all the nodes. We assume that the agents can interact with their neighbors in an unknown sequence of time-varying directed graphs. Following the pioneering work of Jadbabaie, Molavi, Sandroni, and Tahbaz-Salehi, we propose local learning dynamics which combine Bayesian updates at each node with a local aggregation rule of private agent signals. We show that these learning dynamics drive all agents to the set of hypotheses which best explain the data collected at all nodes as long as the sequence of interconnection graphs is uniformly strongly connected. Our main result establishes a non-asymptotic, explicit, geometric convergence rate for the learning dynamic.