
In this paper, we study the efficiency of a {\bf R}estarted {\bf S}ub{\bf
G}radient (RSG) method that periodically restarts the standard subgradient
method (SG). We show that, when applied to a broad class of convex optimization
problems, RSG method can find an $\epsilon$optimal solution with a lower
complexity than the SG method. In particular, we first show that RSG can reduce
the dependence of SG's iteration complexity on the distance between the initial
solution and the optimal set to that between the $\epsilon$level set and the
optimal set {multiplied by a logarithmic factor}. Moreover, we show the
advantages of RSG over SG in solving three different families of convex
optimization problems. (a) For the problems whose epigraph is a polyhedron, RSG
is shown to converge linearly. (b) For the problems with local quadratic growth
property in the $\epsilon$sublevel set, RSG has an
$O(\frac{1}{\epsilon}\log(\frac{1}{\epsilon}))$ iteration complexity. (c) For
the problems that admit a local Kurdyka\L ojasiewicz property with a power
constant of $\beta\in[0,1)$, RSG has an
$O(\frac{1}{\epsilon^{2\beta}}\log(\frac{1}{\epsilon}))$ iteration complexity.
The novelty of our analysis lies at exploiting the lower bound of the
firstorder optimality residual at the $\epsilon$level set. It is this novelty
that allows us to explore the local properties of functions (e.g., local
quadratic growth property, local Kurdyka\L ojasiewicz property, more generally
local error bound conditions) to develop the improved convergence of RSG. { We
also develop a practical variant of RSG enjoying faster convergence than the SG
method, which can be run without knowing the involved parameters in the local
error bound condition.} We demonstrate the effectiveness of the proposed
algorithms on several machine learning tasks including regression,
classification and matrix completion.

To cope with changing environments, recent developments in online learning
have introduced the concepts of adaptive regret and dynamic regret
independently. In this paper, we illustrate an intrinsic connection between
these two concepts by showing that the dynamic regret can be expressed in terms
of the adaptive regret and the functional variation. This observation implies
that strongly adaptive algorithms can be directly leveraged to minimize the
dynamic regret. As a result, we present a series of strongly adaptive
algorithms that have small dynamic regrets for convex functions, exponentially
concave functions, and strongly convex functions, respectively. To the best of
our knowledge, this is the first time that exponential concavity is utilized to
upper bound the dynamic regret. Moreover, all of those adaptive algorithms do
not need any prior knowledge of the functional variation, which is a
significant advantage over previous specialized methods for minimizing dynamic
regret.

Accelerated gradient (AG) methods are breakthroughs in convex optimization,
improving the convergence rate of the gradient descent method for optimization
with smooth functions. However, the analysis of AG methods for nonconvex
optimization is still limited. It remains an open question whether AG methods
from convex optimization can accelerate the convergence of the gradient descent
method for finding local minimum of nonconvex optimization problems. This
paper provides an affirmative answer to this question. In particular, we
analyze two renowned variants of AG methods (namely Polyak's Heavy Ball method
and Nesterov's Accelerated Gradient method) for extracting the negative
curvature from random noise, which is central to escaping from saddle points.
By leveraging the proposed AG methods for extracting the negative curvature, we
present a new AG algorithm with double loops for nonconvex
optimization~\footnote{this is in contrast to a singleloop AG algorithm
proposed in a recent manuscript~\citep{AGNON}, which directly analyzed the
Nesterov's AG method for nonconvex optimization and appeared online on
November 29, 2017. However, we emphasize that our work is an independent work,
which is inspired by our earlier work~\citep{NEON17} and is based on a
different novel analysis.}, which converges to secondorder stationary point
$\x$ such that $\\nabla f(\x)\\leq \epsilon$ and $\nabla^2 f(\x)\geq
\sqrt{\epsilon} I$ with $\widetilde O(1/\epsilon^{1.75})$ iteration
complexity, improving that of gradient descent method by a factor of
$\epsilon^{0.25}$ and matching the best iteration complexity of secondorder
Hessianfree methods for nonconvex optimization.

Two classes of methods have been proposed for escaping from saddle points
with one using the secondorder information carried by the Hessian and the
other adding the noise into the firstorder information. The existing analysis
for algorithms using noise in the firstorder information is quite involved and
hides the essence of added noise, which hinder further improvements of these
algorithms. In this paper, we present a novel perspective of noiseadding
technique, i.e., adding the noise into the firstorder information can help
extract the negative curvature from the Hessian matrix, and provide a formal
reasoning of this perspective by analyzing a simple firstorder procedure. More
importantly, the proposed procedure enables one to design purely firstorder
stochastic algorithms for escaping from nondegenerate saddle points with a
much better time complexity (almost linear time in terms of the problem's
dimensionality). In particular, we develop a {\bf firstorder stochastic
algorithm} based on our new technique and an existing algorithm that only
converges to a firstorder stationary point to enjoy a time complexity of
{$\widetilde O(d/\epsilon^{3.5})$ for finding a nearly secondorder stationary
point $\bf{x}$ such that $\\nabla F(bf{x})\\leq \epsilon$ and $\nabla^2
F(bf{x})\geq \sqrt{\epsilon}I$ (in high probability), where $F(\cdot)$ denotes
the objective function and $d$ is the dimensionality of the problem. To the
best of our knowledge, this is the best theoretical result of firstorder
algorithms for stochastic nonconvex optimization, which is even competitive
with if not better than existing stochastic algorithms hinging on the
secondorder information.

The Hessianvector product has been utilized to find a secondorder
stationary solution with strong complexity guarantee (e.g., almost linear time
complexity in the problem's dimensionality). In this paper, we propose to
further reduce the number of Hessianvector products for faster nonconvex
optimization. Previous algorithms need to approximate the smallest eigenvalue
with a sufficient precision (e.g., $\epsilon_2\ll 1$) in order to achieve a
sufficiently accurate secondorder stationary solution (i.e.,
$\lambda_{\min}(\nabla^2 f(\x))\geq \epsilon_2)$. In contrast, the proposed
algorithms only need to compute the smallest eigenvector approximating the
corresponding eigenvalue up to a small power of current gradient's norm. As a
result, it can dramatically reduce the number of Hessianvector products during
the course of optimization before reaching firstorder stationary points (e.g.,
saddle points). The key building block of the proposed algorithms is a novel
updating step named the NCG step, which lets a noisy negative curvature descent
compete with the gradient descent. We show that the worstcase time complexity
of the proposed algorithms with their favorable prescribed accuracy
requirements can match the best in literature for achieving a secondorder
stationary point but with an arguably smaller periteration cost. We also show
that the proposed algorithms can benefit from inexact Hessian by developing
their variants accepting inexact Hessian under a mild condition for achieving
the same goal. Moreover, we develop a stochastic algorithm for a finite or
infinite sum nonconvex optimization problem. To the best of our knowledge, the
proposed stochastic algorithm is the first one that converges to a secondorder
stationary point in {\it high probability} with a time complexity independent
of the sample size and almost linear in dimensionality.

In this paper, we present a simple analysis of {\bf fast rates} with {\it
high probability} of {\bf empirical minimization} for {\it stochastic composite
optimization} over a finitedimensional bounded convex set with exponential
concave loss functions and an arbitrary convex regularization. To the best of
our knowledge, this result is the first of its kind. As a byproduct, we can
directly obtain the fast rate with {\it high probability} for exponential
concave empirical risk minimization with and without any convex regularization,
which not only extends existing results of empirical risk minimization but also
provides a unified framework for analyzing exponential concave empirical risk
minimization with and without {\it any} convex regularization. Our proof is
very simple only exploiting the covering number of a finitedimensional bounded
set and a concentration inequality of random vectors.

While going deeper has been witnessed to improve the performance of
convolutional neural networks (CNN), going smaller for CNN has received
increasing attention recently due to its attractiveness for mobile/embedded
applications. It remains an active and important topic how to design a small
network while retaining the performance of large and deep CNNs (e.g., Inception
Nets, ResNets). Albeit there are already intensive studies on compressing the
size of CNNs, the considerable drop of performance is still a key concern in
many designs. This paper addresses this concern with several new contributions.
First, we propose a simple yet powerful method for compressing the size of deep
CNNs based on parameter binarization. The striking difference from most
previous work on parameter binarization/quantization lies at different
treatments of $1\times 1$ convolutions and $k\times k$ convolutions ($k>1$),
where we only binarize $k\times k$ convolutions into binary patterns. The
resulting networks are referred to as pattern networks. By doing this, we show
that previous deep CNNs such as GoogLeNet and Inceptiontype Nets can be
compressed dramatically with marginal drop in performance. Second, in light of
the different functionalities of $1\times 1$ (data projection/transformation)
and $k\times k$ convolutions (pattern extraction), we propose a new block
structure codenamed the pattern residual block that adds transformed feature
maps generated by $1\times 1$ convolutions to the pattern feature maps
generated by $k\times k$ convolutions, based on which we design a small network
with $\sim 1$ million parameters. Combining with our parameter binarization, we
achieve better performance on ImageNet than using similar sized networks
including recently released Google MobileNets.

This paper focuses on convex constrained optimization problems, where the
solution is subject to a convex inequality constraint. In particular, we aim at
challenging problems for which both projection into the constrained domain and
a linear optimization under the inequality constraint are timeconsuming, which
render both projected gradient methods and conditional gradient methods (a.k.a.
the FrankWolfe algorithm) expensive. In this paper, we develop projection
reduced optimization algorithms for both smooth and nonsmooth optimization
with improved convergence rates under a certain regularity condition of the
constraint function. We first present a general theory of optimization with
only one projection. Its application to smooth optimization with only one
projection yields $O(1/\epsilon)$ iteration complexity, which improves over the
$O(1/\epsilon^2)$ iteration complexity established before for nonsmooth
optimization and can be further reduced under strong convexity. Then we
introduce a local error bound condition and develop faster algorithms for
nonstrongly convex optimization at the price of a logarithmic number of
projections. In particular, we achieve an iteration complexity of $\widetilde
O(1/\epsilon^{2(1\theta)})$ for nonsmooth optimization and $\widetilde
O(1/\epsilon^{1\theta})$ for smooth optimization, where $\theta\in(0,1]$
appearing the local error bound condition characterizes the functional local
growth rate around the optimal solutions. Novel applications in solving the
constrained $\ell_1$ minimization problem and a positive semidefinite
constrained distance metric learning problem demonstrate that the proposed
algorithms achieve significant speedup compared with previous algorithms.

Recent studies have shown that proximal gradient (PG) method and accelerated
gradient method (APG) with restarting can enjoy a linear convergence under a
weaker condition than strong convexity, namely a quadratic growth condition
(QGC). However, the faster convergence of restarting APG method relies on the
potentially unknown constant in QGC to appropriately restart APG, which
restricts its applicability. We address this issue by developing a novel
adaptive gradient converging methods, i.e., leveraging the magnitude of
proximal gradient as a criterion for restart and termination. Our analysis
extends to a much more general condition beyond the QGC, namely the
H\"{o}lderian error bound (HEB) condition. {\it The key technique} for our
development is a novel synthesis of {\it adaptive regularization and a
conditional restarting scheme}, which extends previous work focusing on
strongly convex problems to a much broader family of problems. Furthermore, we
demonstrate that our results have important implication and applications in
machine learning: (i) if the objective function is coercive and semialgebraic,
PG's convergence speed is essentially $o(\frac{1}{t})$, where $t$ is the total
number of iterations; (ii) if the objective function consists of an $\ell_1$,
$\ell_\infty$, $\ell_{1,\infty}$, or huber norm regularization and a convex
smooth piecewise quadratic loss (e.g., squares loss, squared hinge loss and
huber loss), the proposed algorithm is parameterfree and enjoys a {\it faster
linear convergence} than PG without any other assumptions (e.g., restricted
eigenvalue condition). It is notable that our linear convergence results for
the aforementioned problems are global instead of local. To the best of our
knowledge, these improved results are the first shown in this work.

The lasso model has been widely used for model selection in data mining,
machine learning, and highdimensional statistical analysis. However, due to
the ultrahighdimensional, largescale data sets collected in many realworld
applications, it remains challenging to solve the lasso problems even with
stateoftheart algorithms. Feature screening is a powerful technique for
addressing the Big Data challenge by discarding inactive features from the
lasso optimization. In this paper, we propose a family of hybrid safestrong
rules (HSSR) which incorporate safe screening rules into the sequential strong
rule (SSR) to remove unnecessary computational burden. In particular, we
present two instances of HSSR, namely SSRDome and SSRBEDPP, for the standard
lasso problem. We further extend SSRBEDPP to the elastic net and group lasso
problems to demonstrate the generalizability of the hybrid screening idea.
Extensive numerical experiments with synthetic and real data sets are conducted
for both the standard lasso and the group lasso problems. Results show that our
proposed hybrid rules substantially outperform existing stateoftheart rules.

We propose a doubly stochastic primaldual coordinate optimization algorithm
for empirical risk minimization, which can be formulated as a bilinear
saddlepoint problem. In each iteration, our method randomly samples a block of
coordinates of the primal and dual solutions to update. The linear convergence
of our method could be established in terms of 1) the distance from the current
iterate to the optimal solution and 2) the primaldual objective gap. We show
that the proposed method has a lower overall complexity than existing
coordinate methods when either the data matrix has a factorized structure or
the proximal mapping on each block is computationally expensive, e.g.,
involving an eigenvalue decomposition. The efficiency of the proposed method is
confirmed by empirical studies on several real applications, such as the
multitask large margin nearest neighbor problem.

Although there exist plentiful theories of empirical risk minimization (ERM)
for supervised learning, current theoretical understandings of ERM for a
related problemstochastic convex optimization (SCO), are limited. In this
work, we strengthen the realm of ERM for SCO by exploiting smoothness and
strong convexity conditions to improve the risk bounds. First, we establish an
$\widetilde{O}(d/n + \sqrt{F_*/n})$ risk bound when the random function is
nonnegative, convex and smooth, and the expected function is Lipschitz
continuous, where $d$ is the dimensionality of the problem, $n$ is the number
of samples, and $F_*$ is the minimal risk. Thus, when $F_*$ is small we obtain
an $\widetilde{O}(d/n)$ risk bound, which is analogous to the
$\widetilde{O}(1/n)$ optimistic rate of ERM for supervised learning. Second, if
the objective function is also $\lambda$strongly convex, we prove an
$\widetilde{O}(d/n + \kappa F_*/n )$ risk bound where $\kappa$ is the condition
number, and improve it to $O(1/[\lambda n^2] + \kappa F_*/n)$ when
$n=\widetilde{\Omega}(\kappa d)$. As a result, we obtain an $O(\kappa/n^2)$
risk bound under the condition that $n$ is large and $F_*$ is small, which to
the best of our knowledge, is the first $O(1/n^2)$type of risk bound of ERM.
Third, we stress that the above results are established in a unified framework,
which allows us to derive new risk bounds under weaker conditions, e.g.,
without convexity of the random function and Lipschitz continuity of the
expected function. Finally, we demonstrate that to achieve an $O(1/[\lambda
n^2] + \kappa F_*/n)$ risk bound for supervised learning, the
$\widetilde{\Omega}(\kappa d)$ requirement on $n$ can be replaced with
$\Omega(\kappa^2)$, which is dimensionalityindependent.

In this paper, a new theory is developed for firstorder stochastic convex
optimization, showing that the global convergence rate is sufficiently
quantified by a local growth rate of the objective function in a neighborhood
of the optimal solutions. In particular, if the objective function $F(\mathbf
w)$ in the $\epsilon$sublevel set grows as fast as $\\mathbf w  \mathbf
w_*\_2^{1/\theta}$, where $\mathbf w_*$ represents the closest optimal
solution to $\mathbf w$ and $\theta\in(0,1]$ quantifies the local growth rate,
the iteration complexity of firstorder stochastic optimization for achieving
an $\epsilon$optimal solution can be $\widetilde O(1/\epsilon^{2(1\theta)})$,
which is optimal at most up to a logarithmic factor. To achieve the faster
global convergence, we develop two different accelerated stochastic subgradient
methods by iteratively solving the original problem approximately in a local
region around a historical solution with the size of the local region gradually
decreasing as the solution approaches the optimal set. Besides the theoretical
improvements, this work also includes new contributions towards making the
proposed algorithms practical: (i) we present practical variants of accelerated
stochastic subgradient methods that can run without the knowledge of
multiplicative growth constant and even the growth rate $\theta$; (ii) we
consider a broad family of problems in machine learning to demonstrate that the
proposed algorithms enjoy faster convergence than traditional stochastic
subgradient method. We also characterize the complexity of the proposed
algorithms for ensuring the gradient is small without the smoothness
assumption.

In this paper, we address learning problems for high dimensional data.
Previously, oblivious random projection based approaches that project high
dimensional features onto a random subspace have been used in practice for
tackling highdimensionality challenge in machine learning. Recently, various
nonoblivious randomized reduction methods have been developed and deployed for
solving many numerical problems such as matrix product approximation, lowrank
matrix approximation, etc. However, they are less explored for the machine
learning tasks, e.g., classification. More seriously, the theoretical analysis
of excess risk bounds for risk minimization, an important measure of
generalization performance, has not been established for nonoblivious
randomized reduction methods. It therefore remains an open problem what is the
benefit of using them over previous oblivious random projection based
approaches. To tackle these challenges, we propose an algorithmic framework for
employing nonoblivious randomized reduction method for general empirical risk
minimizing in machine learning tasks, where the original highdimensional
features are projected onto a random subspace that is derived from the data
with a small matrix approximation error. We then derive the first excess risk
bound for the proposed nonoblivious randomized reduction approach without
requiring strong assumptions on the training data. The established excess risk
bound exhibits that the proposed approach provides much better generalization
performance and it also sheds more insights about different randomized
reduction approaches. Finally, we conduct extensive experiments on both
synthetic and realworld benchmark datasets, whose dimension scales to
$O(10^7)$, to demonstrate the efficacy of our proposed approach.

Dropout has been witnessed with great success in training deep neural
networks by independently zeroing out the outputs of neurons at random. It has
also received a surge of interest for shallow learning, e.g., logistic
regression. However, the independent sampling for dropout could be suboptimal
for the sake of convergence. In this paper, we propose to use multinomial
sampling for dropout, i.e., sampling features or neurons according to a
multinomial distribution with different probabilities for different
features/neurons. To exhibit the optimal dropout probabilities, we analyze the
shallow learning with multinomial dropout and establish the risk bound for
stochastic optimization. By minimizing a sampling dependent factor in the risk
bound, we obtain a distributiondependent dropout with sampling probabilities
dependent on the second order statistics of the data distribution. To tackle
the issue of evolving distribution of neurons in deep learning, we propose an
efficient adaptive dropout (named \textbf{evolutional dropout}) that computes
the sampling probabilities onthefly from a minibatch of examples. Empirical
studies on several benchmark datasets demonstrate that the proposed dropouts
achieve not only much faster convergence and but also a smaller testing error
than the standard dropout. For example, on the CIFAR100 data, the evolutional
dropout achieves relative improvements over 10\% on the prediction performance
and over 50\% on the convergence speed compared to the standard dropout.

In this paper, we develop a novel {\bf ho}moto{\bf p}y {\bf s}moothing (HOPS)
algorithm for solving a family of nonsmooth problems that is composed of a
nonsmooth term with an explicit maxstructure and a smooth term or a simple
nonsmooth term whose proximal mapping is easy to compute. The best known
iteration complexity for solving such nonsmooth optimization problems is
$O(1/\epsilon)$ without any assumption on the strong convexity. In this work,
we will show that the proposed HOPS achieved a lower iteration complexity of
$\widetilde O(1/\epsilon^{1\theta})$\footnote{$\widetilde O()$ suppresses a
logarithmic factor.} with $\theta\in(0,1]$ capturing the local sharpness of the
objective function around the optimal solutions. To the best of our knowledge,
this is the lowest iteration complexity achieved so far for the considered
nonsmooth optimization problems without strong convexity assumption. The HOPS
algorithm employs Nesterov's smoothing technique and Nesterov's accelerated
gradient method and runs in stages, which gradually decreases the smoothing
parameter in a stagewise manner until it yields a sufficiently good
approximation of the original function. We show that HOPS enjoys a linear
convergence for many wellknown nonsmooth problems (e.g., empirical risk
minimization with a piecewise linear loss function and $\ell_1$ norm
regularizer, finding a point in a polyhedron, cone programming, etc).
Experimental results verify the effectiveness of HOPS in comparison with
Nesterov's smoothing algorithm and the primaldual style of firstorder
methods.

In prior works, stochastic dual coordinate ascent (SDCA) has been
parallelized in a multicore environment where the cores communicate through
shared memory, or in a multiprocessor distributed memory environment where the
processors communicate through message passing. In this paper, we propose a
hybrid SDCA framework for multicore clusters, the most common high performance
computing environment that consists of multiple nodes each having multiple
cores and its own shared memory. We distribute data across nodes where each
node solves a local problem in an asynchronous parallel fashion on its cores,
and then the local updates are aggregated via an asynchronous acrossnode
update scheme. The proposed double asynchronous method converges to a global
solution for $L$Lipschitz continuous loss functions, and at a linear
convergence rate if a smooth convex loss function is used. Extensive empirical
comparison has shown that our algorithm scales better than the best known
sharedmemory methods and runs faster than previous distributedmemory methods.
Big datasets, such as one of 280 GB from the LIBSVM repository, cannot be
accommodated on a single node and hence cannot be solved by a parallel
algorithm. For such a dataset, our hybrid algorithm takes 30 seconds to achieve
a duality gap of $10^{6}$ on 16 nodes each using 8 cores, which is
significantly faster than the best known distributed algorithms, such as
CoCoA+, that take more than 300 seconds on 16 nodes.

In this paper, we develop a randomized algorithm and theory for learning a
sparse model from largescale and highdimensional data, which is usually
formulated as an empirical risk minimization problem with a sparsityinducing
regularizer. Under the assumption that there exists a (approximately) sparse
solution with high classification accuracy, we argue that the dual solution is
also sparse or approximately sparse. The fact that both primal and dual
solutions are sparse motivates us to develop a randomized approach for a
general convexconcave optimization problem. Specifically, the proposed
approach combines the strength of random projection with that of sparse
learning: it utilizes random projection to reduce the dimensionality, and
introduces $\ell_1$norm regularization to alleviate the approximation error
caused by random projection. Theoretical analysis shows that under favored
conditions, the randomized algorithm can accurately recover the optimal
solutions to the convexconcave optimization problem (i.e., recover both the
primal and dual solutions).

In this paper, we present a novel yet simple homotopy proximal mapping
algorithm for compressive sensing. The algorithm adopts a simple proximal
mapping of the $\ell_1$ norm at each iteration and gradually reduces the
regularization parameter for the $\ell_1$ norm. We prove a global linear
convergence of the proposed homotopy proximal mapping (HPM) algorithm for
solving compressive sensing under three different settings (i) sparse signal
recovery under noiseless measurements, (ii) sparse signal recovery under noisy
measurements, and (iii) nearlysparse signal recovery under subgaussian noisy
measurements. In particular, we show that when the measurement matrix satisfies
Restricted Isometric Properties (RIP), our theoretical results in settings (i)
and (ii) almost recover the best condition on the RIP constants for compressive
sensing. In addition, in setting (iii), our results for sparse signal recovery
are better than the previous results, and furthermore our analysis explicitly
exhibits that more observations lead to not only more accurate recovery but
also faster convergence. Compared with previous studies on linear convergence
for sparse signal recovery, our algorithm is simple and efficient, and our
results are better and provide more insights. Finally our empirical studies
provide further support for the proposed homotopy proximal mapping algorithm
and verify the theoretical results.

Recently, there has been a growing research interest in the analysis of
dynamic regret, which measures the performance of an online learner against a
sequence of local minimizers. By exploiting the strong convexity, previous
studies have shown that the dynamic regret can be upper bounded by the
pathlength of the comparator sequence. In this paper, we illustrate that the
dynamic regret can be further improved by allowing the learner to query the
gradient of the function multiple times, and meanwhile the strong convexity can
be weakened to other nondegeneracy conditions. Specifically, we introduce the
squared pathlength, which could be much smaller than the pathlength, as a new
regularity of the comparator sequence. When multiple gradients are accessible
to the learner, we first demonstrate that the dynamic regret of strongly convex
functions can be upper bounded by the minimum of the pathlength and the
squared pathlength. We then extend our theoretical guarantee to functions that
are semistrongly convex or selfconcordant. To the best of our knowledge, this
is the first time the semistrong convexity and the selfconcordance are
utilized to tighten the dynamic regret.

We consider stochastic strongly convex optimization with a complex inequality
constraint. This complex inequality constraint may lead to computationally
expensive projections in algorithmic iterations of the stochastic gradient
descent~(SGD) methods. To reduce the computation costs pertaining to the
projections, we propose an EpochProjection Stochastic Gradient
Descent~(EproSGD) method. The proposed EproSGD method consists of a sequence
of epochs; it applies SGD to an augmented objective function at each iteration
within the epoch, and then performs a projection at the end of each epoch.
Given a strongly convex optimization and for a total number of $T$ iterations,
EproSGD requires only $\log(T)$ projections, and meanwhile attains an optimal
convergence rate of $O(1/T)$, both in expectation and with a high probability.
To exploit the structure of the optimization problem, we propose a proximal
variant of EproSGD, namely EproORDA, based on the optimal regularized dual
averaging method. We apply the proposed methods on realworld applications; the
empirical results demonstrate the effectiveness of our methods.

This work focuses on dynamic regret of online convex optimization that
compares the performance of online learning to a clairvoyant who knows the
sequence of loss functions in advance and hence selects the minimizer of the
loss function at each step. By assuming that the clairvoyant moves slowly
(i.e., the minimizers change slowly), we present several improved
variationbased upper bounds of the dynamic regret under the true and noisy
gradient feedback, which are {\it optimal} in light of the presented lower
bounds. The key to our analysis is to explore a regularity metric that measures
the temporal changes in the clairvoyant's minimizers, to which we refer as {\it
path variation}. Firstly, we present a general lower bound in terms of the path
variation, and then show that under full information or gradient feedback we
are able to achieve an optimal dynamic regret. Secondly, we present a lower
bound with noisy gradient feedback and then show that we can achieve optimal
dynamic regrets under a stochastic gradient feedback and twopoint bandit
feedback. Moreover, for a sequence of smooth loss functions that admit a small
variation in the gradients, our dynamic regret under the twopoint bandit
feedback matches what is achieved with full information.

Recently, {\it stochastic momentum} methods have been widely adopted in
training deep neural networks. However, their convergence analysis is still
underexplored at the moment, in particular for nonconvex optimization. This
paper fills the gap between practice and theory by developing a basic
convergence analysis of two stochastic momentum methods, namely stochastic
heavyball method and the stochastic variant of Nesterov's accelerated gradient
method. We hope that the basic convergence results developed in this paper can
serve the reference to the convergence of stochastic momentum methods and also
serve the baselines for comparison in future development of stochastic momentum
methods. The novelty of convergence analysis presented in this paper is a
unified framework, revealing more insights about the similarities and
differences between different stochastic momentum methods and stochastic
gradient method. The unified framework exhibits a continuous change from the
gradient method to Nesterov's accelerated gradient method and finally the
heavyball method incurred by a free parameter, which can help explain a
similar change observed in the testing error convergence behavior for deep
learning. Furthermore, our empirical results for optimizing deep neural
networks demonstrate that the stochastic variant of Nesterov's accelerated
gradient method achieves a good tradeoff (between speed of convergence in
training error and robustness of convergence in testing error) among the three
stochastic methods.

Attributes possess appealing properties and benefit many computer vision
problems, such as object recognition, learning with humans in the loop, and
image retrieval. Whereas the existing work mainly pursues utilizing attributes
for various computer vision problems, we contend that the most basic
problemhow to accurately and robustly detect attributes from imageshas
been left under explored. Especially, the existing work rarely explicitly
tackles the need that attribute detectors should generalize well across
different categories, including those previously unseen. Noting that this is
analogous to the objective of multisource domain generalization, if we treat
each category as a domain, we provide a novel perspective to attribute
detection and propose to gear the techniques in multisource domain
generalization for the purpose of learning crosscategory generalizable
attribute detectors. We validate our understanding and approach with extensive
experiments on four challenging datasets and three different problems.

In this paper, we show that simple {Stochastic} subGradient Decent methods
with multiple Restarting, named {\bf RSGD}, can achieve a \textit{linear
convergence rate} for a class of nonsmooth and nonstrongly convex
optimization problems where the epigraph of the objective function is a
polyhedron, to which we refer as {\bf polyhedral convex optimization}. Its
applications in machine learning include $\ell_1$ constrained or regularized
piecewise linear loss minimization and submodular function minimization. To the
best of our knowledge, this is the first result on the linear convergence rate
of stochastic subgradient methods for nonsmooth and nonstrongly convex
optimization problems.