
We propose methods for distributed graphbased multitask learning that are
based on weighted averaging of messages from other machines. Uniform averaging
or diminishing stepsize in these methods would yield consensus (single task)
learning. We show how simply skewing the averaging weights or controlling the
stepsize allows learning different, but related, tasks on the different
machines.

We present novel minibatch stochastic optimization methods for empirical risk
minimization problems, the methods efficiently leverage variance reduced
firstorder and subsampled higherorder information to accelerate the
convergence speed. For quadratic objectives, we prove improved iteration
complexity over stateoftheart under reasonable assumptions. We also provide
empirical evidence of the advantages of our method compared to existing
approaches in the literature.

In modern largescale machine learning applications, the training data are
often partitioned and stored on multiple machines. It is customary to employ
the "data parallelism" approach, where the aggregated training loss is
minimized without moving data across machines. In this paper, we introduce a
novel distributed dual formulation for regularized loss minimization problems
that can directly handle data parallelism in the distributed setting. This
formulation allows us to systematically derive dual coordinate optimization
procedures, which we refer to as Distributed Alternating Dual Maximization
(DADM). The framework extends earlier studies described in (Boyd et al., 2011;
Ma et al., 2015a; Jaggi et al., 2014; Yang, 2013) and has rigorous theoretical
analyses. Moreover with the help of the new formulation, we develop the
accelerated version of DADM (AccDADM) by generalizing the acceleration
technique from (ShalevShwartz and Zhang, 2014) to the distributed setting. We
also provide theoretical results for the proposed accelerated version and the
new result improves previous ones (Yang, 2013; Ma et al., 2015a) whose runtimes
grow linearly on the condition number. Our empirical studies validate our
theory and show that our accelerated approach significantly improves the
previous stateoftheart distributed dual coordinate optimization algorithms.

We present and analyze an approach for distributed stochastic optimization
which is statistically optimal and achieves nearlinear speedups (up to
logarithmic factors). Our approach allows a communicationmemory tradeoff, with
either logarithmic communication but linear memory, or polynomial communication
and a corresponding polynomial reduction in required memory. This
communicationmemory tradeoff is achieved through minibatchprox iterations
(minibatch passiveaggressive updates), where a subproblem on a minibatch is
solved at each iteration. We provide a novel analysis for such a minibatchprox
procedure which achieves the statistical optimal rate regardless of minibatch
size and smoothness, thus significantly improving on prior work.

We consider empirical risk minimization of linear predictors with convex loss
functions. Such problems can be reformulated as convexconcave saddle point
problems, and thus are well suitable for primaldual firstorder algorithms.
However, primaldual algorithms often require explicit strongly convex
regularization in order to obtain fast linear convergence, and the required
dual proximal mapping may not admit closedform or efficient solution. In this
paper, we develop both batch and randomized primaldual algorithms that can
exploit strong convexity from data adaptively and are capable of achieving
linear convergence even without regularization. We also present dualfree
variants of the adaptive primaldual algorithms that do not require computing
the dual proximal mapping, which are especially suitable for logistic
regression.

We develop and analyze efficient "coordinatewise" methods for finding the
leading eigenvector, where each step involves only a vectorvector product. We
establish global convergence with overall runtime guarantees that are at least
as good as Lanczos's method and dominate it for slowly decaying spectrum. Our
methods are based on combining a shiftandinvert approach with coordinatewise
algorithms for linear regression.

We study the sample complexity of canonical correlation analysis (CCA), \ie,
the number of samples needed to estimate the population canonical correlation
and directions up to arbitrarily small error. With mild assumptions on the data
distribution, we show that in order to achieve $\epsilon$suboptimality in a
properly defined measure of alignment between the estimated canonical
directions and the population solution, we can solve the empirical objective
exactly with $N(\epsilon, \Delta, \gamma)$ samples, where $\Delta$ is the
singular value gap of the whitened crosscovariance matrix and $1/\gamma$ is an
upper bound of the condition number of autocovariance matrices. Moreover, we
can achieve the same learning accuracy by drawing the same level of samples and
solving the empirical objective approximately with a stochastic optimization
algorithm; this algorithm is based on the shiftandinvert power iterations and
only needs to process the dataset for $\mathcal{O}\left(\log \frac{1}{\epsilon}
\right)$ passes. Finally, we show that, given an estimate of the canonical
correlation, the streaming version of the shiftandinvert power iterations
achieves the same learning accuracy with the same level of sample complexity,
by processing the data only once.

We consider the problem of estimating and constructing componentwise
confidence intervals of a sparse highdimensional linear regression model when
some covariates of the design matrix are missing completely at random. A
variant of the Dantzig selector (Candes & Tao, 2007) is analyzed for estimating
the regression model and a debiasing argument is employed to construct
componentwise confidence intervals under additional assumptions on the
covariance of the design matrix. We also derive rates of convergence of the
meansquare estimation error and the average confidence interval length, and
show that the dependency over several model parameters (e.g., sparsity $s$,
portion of observed covariates $\rho_*$, signal level $\\beta_0\_2$) are
optimal in a minimax sense.

We consider Bayesian optimization of an expensivetoevaluate blackbox
objective function, where we also have access to cheaper approximations of the
objective. In general, such approximations arise in applications such as
reinforcement learning, engineering, and the natural sciences, and are subject
to an inherent, unknown bias. This model discrepancy is caused by an inadequate
internal model that deviates from reality and can vary over the domain, making
the utilization of these approximations a nontrivial task.
We present a novel algorithm that provides a rigorous mathematical treatment
of the uncertainties arising from model discrepancies and noisy observations.
Its optimization decisions rely on a value of information analysis that extends
the Knowledge Gradient factor to the setting of multiple information sources
that vary in cost: each sampling decision maximizes the predicted benefit per
unit cost.
We conduct an experimental evaluation that demonstrates that the method
consistently outperforms other stateoftheart techniques: it finds designs of
considerably higher objective value and additionally inflicts less cost in the
exploration process.

We study the stochastic optimization of canonical correlation analysis (CCA),
whose objective is nonconvex and does not decouple over training samples.
Although several stochastic gradient based optimization algorithms have been
recently proposed to solve this problem, no global convergence guarantee was
provided by any of them. Inspired by the alternating least squares/power
iterations formulation of CCA, and the shiftandinvert preconditioning method
for PCA, we propose two globally convergent metaalgorithms for CCA, both of
which transform the original problem into sequences of least squares problems
that need only be solved approximately. We instantiate the metaalgorithms with
stateoftheart SGD methods and obtain time complexities that significantly
improve upon that of previous work. Experimental results demonstrate their
superior performance.

Sketching techniques have become popular for scaling up machine learning
algorithms by reducing the sample size or dimensionality of massive data sets,
while still maintaining the statistical power of big data. In this paper, we
study sketching from an optimization point of view: we first show that the
iterative Hessian sketch is an optimization process with preconditioning, and
develop accelerated iterative Hessian sketch via the searching the conjugate
direction; we then establish primaldual connections between the Hessian sketch
and dual random projection, and apply the preconditioned conjugate gradient
approach on the dual problem, which leads to the accelerated iterative dual
random projection methods. Finally to tackle the challenges from both large
sample size and highdimensionality, we propose the primaldual sketch, which
iteratively sketches the primal and dual formulations. We show that using a
logarithmic number of calls to solvers of small scale problem, primaldual
sketch is able to recover the optimum of the original problem up to arbitrary
precision. The proposed algorithms are validated via extensive experiments on
synthetic and real data sets which complements our theoretical results.

We develop a framework for warmstarting Bayesian optimization, that reduces
the solution time required to solve an optimization problem that is one in a
sequence of related problems. This is useful when optimizing the output of a
stochastic simulator that fails to provide derivative information, for which
Bayesian optimization methods are wellsuited. Solving sequences of related
optimization problems arises when making several business decisions using one
optimization model and input data collected over different time periods or
markets. While many gradientbased methods can be warm started by initiating
optimization at the solution to the previous problem, this warm start approach
does not apply to Bayesian optimization methods, which carry a full metamodel
of the objective function from iteration to iteration. Our approach builds a
joint statistical model of the entire collection of related objective
functions, and uses a value of information calculation to recommend points to
evaluate.

We propose a novel, efficient approach for distributed sparse learning in
highdimensions, where observations are randomly partitioned across machines.
Computationally, at each round our method only requires the master machine to
solve a shifted ell_1 regularized Mestimation problem, and other workers to
compute the gradient. In respect of communication, the proposed approach
provably matches the estimation error bound of centralized methods within
constant rounds of communications (ignoring logarithmic factors). We conduct
extensive experiments on both simulated and real world datasets, and
demonstrate encouraging performances on highdimensional regression and
classification tasks.

We consider the problem of removing and replacing clouds in satellite image
sequences, which has a wide range of applications in remote sensing. Our
approach first detects and removes the cloudcontaminated part of the image
sequences. It then recovers the missing scenes from the clean parts using the
proposed "TECROMAC" (TEmporally Contiguous RObust MAtrix Completion) objective.
The objective function balances temporal smoothness with a low rank solution
while staying close to the original observations. The matrix whose the rows are
pixels and columnsare days corresponding to the image, has lowrank because the
pixels reflect landtypes such as vegetation, roads and lakes and there are
relatively few variations as a result. We provide efficient optimization
algorithms for TECROMAC, so we can exploit images containing millions of
pixels. Empirical results on real satellite image sequences, as well as
simulated data, demonstrate that our approach is able to recover underlying
images from heavily cloudcontaminated observations.

We study the problem of distributed multitask learning with shared
representation, where each machine aims to learn a separate, but related, task
in an unknown shared lowdimensional subspaces, i.e. when the predictor matrix
has low rank. We consider a setting where each task is handled by a different
machine, with samples for the task available locally on the machine, and study
communicationefficient methods for exploiting the shared structure.

We consider parallel global optimization of derivativefree
expensivetoevaluate functions, and propose an efficient method based on
stochastic approximation for implementing a conceptual Bayesian optimization
algorithm proposed by Ginsbourger et al. (2007). At the heart of this algorithm
is maximizing the information criterion called the "multipoints expected
improvement'', or the qEI. To accomplish this, we use infinitessimal
perturbation analysis (IPA) to construct a stochastic gradient estimator and
show that this estimator is unbiased. We also show that the stochastic gradient
ascent algorithm using the constructed gradient estimator converges to a
stationary point of the qEI surface, and therefore, as the number of multiple
starts of the gradient ascent algorithm and the number of steps for each start
grow large, the onestep Bayes optimal set of points is recovered. We show in
numerical experiments that our method for maximizing the qEI is faster than
methods based on closedform evaluation using highdimensional integration,
when considering many parallel function evaluations, and is comparable in speed
when considering few. We also show that the resulting onestep Bayes optimal
algorithm for parallel global optimization finds highquality solutions with
fewer evaluations than a heuristic based on approximately maximizing the qEI.
A highquality open source implementation of this algorithm is available in the
open source Metrics Optimization Engine (MOE).

Contrary to the situation with stochastic gradient descent, we argue that
when using stochastic methods with variance reduction, such as SDCA, SAG or
SVRG, as well as their variants, it could be beneficial to reuse previously
used samples instead of fresh samples, even when fresh samples are available.
We demonstrate this empirically for SDCA, SAG and SVRG, studying the optimal
sample size one should use, and also uncover behavior that suggests running
SDCA for an integer number of epochs could be wasteful.

We consider the problem of distributed multitask learning, where each
machine learns a separate, but related, task. Specifically, each machine learns
a linear predictor in highdimensional space,where all tasks share the same
small support. We present a communicationefficient estimator based on the
debiased lasso and show that it is comparable with the optimal centralized
method.

We introduce Bayesian optimization, a technique developed for optimizing
timeconsuming engineering simulations and for fitting machine learning models
on large datasets. Bayesian optimization guides the choice of experiments
during materials design and discovery to find good material designs in as few
experiments as possible. We focus on the case when materials designs are
parameterized by a lowdimensional vector. Bayesian optimization is built on a
statistical technique called Gaussian process regression, which allows
predicting the performance of a new design based on previously tested designs.
After providing a detailed introduction to Gaussian process regression, we
introduce two Bayesian optimization methods: expected improvement, for design
problems with noisefree evaluations; and the knowledgegradient method, which
generalizes expected improvement and may be used in design problems with noisy
evaluations. Both methods are derived using a valueofinformation analysis,
and enjoy onestep Bayesoptimality.

Given $n$ i.i.d. observations of a random vector $(X,Z)$, where $X$ is a
highdimensional vector and $Z$ is a lowdimensional index variable, we study
the problem of estimating the conditional inverse covariance matrix $\Omega(z)
= (E[(XE[X \mid Z])(XE[X \mid Z])^T \mid Z=z])^{1}$ under the assumption
that the set of nonzero elements is small and does not depend on the index
variable. We develop a novel procedure that combines the ideas of the local
constant smoothing and the group Lasso for estimating the conditional inverse
covariance matrix. A proximal iterative smoothing algorithm is used to solve
the corresponding convex optimization problems. We prove that our procedure
recovers the conditional independence assumptions of the distribution $X \mid
Z$ with high probability. This result is established by developing a uniform
deviation bound for the highdimensional conditional covariance matrix from its
population counterpart, which may be of independent interest. Furthermore, we
develop pointwise confidence intervals for individual elements of the
conditional inverse covariance matrix. We perform extensive simulation studies,
in which we demonstrate that our proposed procedure outperforms sensible
competitors. We illustrate our proposal on a S&P 500 stock price data set.

In this paper, we propose a new Soft ConfidenceWeighted (SCW) online
learning scheme, which enables the conventional confidenceweighted learning
method to handle nonseparable cases. Unlike the previous confidenceweighted
learning algorithms, the proposed soft confidenceweighted learning method
enjoys all the four salient properties: (i) large margin training, (ii)
confidence weighting, (iii) capability to handle nonseparable data, and (iv)
adaptive margin. Our experimental results show that the proposed SCW algorithms
significantly outperform the original CW algorithm. When comparing with a
variety of stateoftheart algorithms (including AROW, NAROW and NHERD), we
found that SCW generally achieves better or at least comparable predictive
accuracy, but enjoys significant advantage of computational efficiency (i.e.,
smaller number of updates and lower time cost).