
We consider a multitask learning problem, in which several predictors are
learned jointly. Prior research has shown that learning the relations between
tasks, and between the input features, together with the predictor, can lead to
better generalization and interpretability, which proved to be useful for
applications in many domains. In this paper, we consider a formulation of
multitask learning that learns the relationships both between tasks and between
features, represented through a task covariance and a feature covariance
matrix, respectively. First, we demonstrate that existing methods proposed for
this problem present an issue that may lead to illposed optimization. We then
propose an alternative formulation, as well as an efficient algorithm to
optimize it. Using ideas from optimization and graph theory, we propose an
efficient coordinatewise minimization algorithm that has a closed form
solution for each block subproblem. Our experiments show that the proposed
optimization method is orders of magnitude faster than its competitors. We also
provide a nonlinear extension that is able to achieve better generalization
than existing methods.

Faced with distribution shift between training and test set, we wish to
detect and quantify the shift, and to correct our classifiers without test set
labels. Motivated by medical diagnosis, where diseases (targets), cause
symptoms (observations), we focus on label shift, where the label marginal
$p(y)$ changes but the conditional $p(xy)$ does not. We propose Black Box
Shift Estimation (BBSE) to estimate the test distribution $p(y)$. BBSE exploits
arbitrary black box predictors to reduce dimensionality prior to shift
correction. While better predictors give tighter estimates, BBSE works even
when predictors are biased, inaccurate, or uncalibrated, so long as their
confusion matrices are invertible. We prove BBSE's consistency, bound its
error, and introduce a statistical test that uses BBSE to detect shift. We also
leverage BBSE to correct classifiers. Experiments demonstrate accurate
estimates and improved prediction, even on highdimensional datasets of natural
images.

We propose a method to optimize the representation and distinguishability of
samples from two probability distributions, by maximizing the estimated power
of a statistical test based on the maximum mean discrepancy (MMD). This
optimized MMD is applied to the setting of unsupervised learning by generative
adversarial networks (GAN), in which a model attempts to generate realistic
samples, and a discriminator attempts to tell these apart from data samples. In
this context, the MMD may be used in two roles: first, as a discriminator,
either directly on the samples, or on features of the samples. Second, the MMD
can be used to evaluate the performance of a generative model, by testing the
model's samples against a reference data set. In the latter role, the optimized
MMD is particularly helpful, as it gives an interpretable indication of how the
model and data distributions differ, even in cases where individual model
samples are not easily distinguished either by eye or by classifier.

In distributed machine learning, data is dispatched to multiple machines for
processing. Motivated by the fact that similar data points often belong to the
same or similar classes, and more generally, classification rules of high
accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou
1993), we propose data dependent dispatching that takes advantage of such
structure. We present an indepth analysis of this model, providing new
algorithms with provable worstcase guarantees, analysis proving existing
scalable heuristics perform well in natural non worstcase conditions, and
techniques for extending a dispatching rule from a small sample to the entire
distribution. We overcome novel technical challenges to satisfy important
conditions for accurate distributed learning, including fault tolerance and
balancedness. We empirically compare our approach with baselines based on
random partitioning, balanced partition trees, and locality sensitive hashing,
showing that we achieve significantly higher accuracy on both synthetic and
real world image and advertising datasets. We also demonstrate that our
technique strongly scales with the available computing power.

In this paper, we present two new communicationefficient methods for
distributed minimization of an average of functions. The first algorithm is an
inexact variant of the DANE algorithm that allows any local algorithm to return
an approximate solution to a local subproblem. We show that such a strategy
does not affect the theoretical guarantees of DANE significantly. In fact, our
approach can be viewed as a robustification strategy since the method is
substantially better behaved than DANE on data partition arising in practice.
It is well known that DANE algorithm does not match the communication
complexity lower bounds. To bridge this gap, we propose an accelerated variant
of the first method, called AIDE, that not only matches the communication lower
bounds but can also be implemented using a purely firstorder oracle. Our
empirical results show that AIDE is superior to other communication efficient
algorithms in settings that naturally arise in machine learning applications.

We study FrankWolfe methods for nonconvex stochastic and finitesum
optimization problems. FrankWolfe methods (in the convex case) have gained
tremendous recent interest in machine learning and optimization communities due
to their projectionfree property and their ability to exploit structured
constraints. However, our understanding of these algorithms in the nonconvex
setting is fairly limited. In this paper, we propose nonconvex stochastic
FrankWolfe methods and analyze their convergence properties. For objective
functions that decompose into a finitesum, we leverage ideas from variance
reduction techniques for convex optimization to obtain new variance reduced
nonconvex FrankWolfe methods that have provably faster convergence than the
classical FrankWolfe method. Finally, we show that the faster convergence
rates of our variance reduced methods also translate into improved convergence
rates for the stochastic setting.

Knowing which words have been attended to in previous time steps while
generating a translation is a rich source of information for predicting what
words will be attended to in the future. We improve upon the attention model of
Bahdanau et al. (2014) by explicitly modeling the relationship between previous
and subsequent attention levels for each word using one recurrent network per
input word. This architecture easily captures informative features, such as
fertility and regularities in relative distortion. In experiments, we show our
parameterization of attention improves translation quality.

We introduce a family of adaptive estimators on graphs, based on penalizing
the $\ell_1$ norm of discrete graph differences. This generalizes the idea of
trend filtering [Kim et al. (2009), Tibshirani (2014)], used for univariate
nonparametric regression, to graphs. Analogous to the univariate case, graph
trend filtering exhibits a level of local adaptivity unmatched by the usual
$\ell_2$based graph smoothers. It is also defined by a convex minimization
problem that is readily solved (e.g., by fast ADMM or Newton algorithms). We
demonstrate the merits of graph trend filtering through examples and theory.

We analyze stochastic algorithms for optimizing nonconvex, nonsmooth
finitesum problems, where the nonconvex part is smooth and the nonsmooth part
is convex. Surprisingly, unlike the smooth case, our knowledge of this
fundamental problem is very limited. For example, it is not known whether the
proximal stochastic gradient method with constant minibatch converges to a
stationary point. To tackle this issue, we develop fast stochastic algorithms
that provably converge to a stationary point for constant minibatches.
Furthermore, using a variant of these algorithms, we show provably faster
convergence than batch proximal gradient descent. Finally, we prove global
linear convergence rate for an interesting subclass of nonsmooth nonconvex
functions, that subsumes several recent works. This paper builds upon our
recent series of papers on fast stochastic methods for smooth nonconvex
optimization [22, 23], with a novel analysis for nonconvex and nonsmooth
functions.

We study nonconvex finitesum problems and analyze stochastic variance
reduced gradient (SVRG) methods for them. SVRG and related methods have
recently surged into prominence for convex optimization given their edge over
stochastic gradient descent (SGD); but their theoretical analysis almost
exclusively assumes convexity. In contrast, we prove nonasymptotic rates of
convergence (to stationary points) of SVRG for nonconvex optimization, and show
that it is provably faster than SGD and gradient descent. We also analyze a
subclass of nonconvex problems on which SVRG attains linear convergence to the
global optimum. We extend our analysis to minibatch variants of SVRG, showing
(theoretical) linear speedup due to minibatching in parallel settings.

We analyze a fast incremental aggregated gradient method for optimizing
nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze
the SAGA algorithm within an Incremental Firstorder Oracle framework, and show
that it converges to a stationary point provably faster than both gradient
descent and stochastic gradient descent. We also discuss a Polyak's special
class of nonconvex problems for which SAGA converges at a linear rate to the
global optimum. Finally, we analyze the practically valuable regularized and
minibatch variants of SAGA. To our knowledge, this paper presents the first
analysis of fast convergence for an incremental aggregated gradient method for
nonconvex problems.

This paper presents stacked attention networks (SANs) that learn to answer
natural language questions from images. SANs use semantic representation of a
question as query to search for the regions in an image that are related to the
answer. We argue that image question answering (QA) often requires multiple
steps of reasoning. Thus, we develop a multiplelayer SAN in which we query an
image multiple times to infer the answer progressively. Experiments conducted
on four image QA data sets demonstrate that the proposed SANs significantly
outperform previous stateoftheart approaches. The visualization of the
attention layers illustrates the progress that the SAN locates the relevant
visual clues that lead to the answer of the question layerbylayer.

We study optimization algorithms based on variance reduction for stochastic
gradient descent (SGD). Remarkable recent progress has been made in this
direction through development of algorithms like SAG, SVRG, SAGA. These
algorithms have been shown to outperform SGD, both theoretically and
empirically. However, asynchronous versions of these algorithmsa crucial
requirement for modern largescale applicationshave not been studied. We
bridge this gap by presenting a unifying framework for many variance reduction
techniques. Subsequently, we propose an asynchronous algorithm grounded in our
framework, and prove its fast convergence. An important consequence of our
general approach is that it yields asynchronous versions of variance reduction
algorithms such as SVRG and SAGA as a byproduct. Our method achieves near
linear speedup in sparse settings common to machine learning. We demonstrate
the empirical performance of our method through a concrete realization of
asynchronous SVRG.

The fully connected layers of a deep convolutional neural network typically
contain over 90% of the network parameters, and consume the majority of the
memory required to store the network parameters. Reducing the number of
parameters while preserving essentially the same predictive performance is
critically important for operating deep neural networks in memory constrained
environments such as GPUs or embedded devices.
In this paper we show how kernel methods, in particular a single Fastfood
layer, can be used to replace all fully connected layers in a deep
convolutional neural network. This novel Fastfood layer is also endtoend
trainable in conjunction with convolutional layers, allowing us to combine them
into a new architecture, named deep fried convolutional networks, which
substantially reduces the memory footprint of convolutional networks trained on
MNIST and ImageNet with no drop in predictive performance.

We consider the problem of Bayesian learning on sensitive datasets and
present two simple but somewhat surprising results that connect Bayesian
learning to "differential privacy:, a cryptographic approach to protect
individuallevel privacy while permiting databaselevel utility. Specifically,
we show that that under standard assumptions, getting one single sample from a
posterior distribution is differentially private "for free". We will see that
estimator is statistically consistent, near optimal and computationally
tractable whenever the Bayesian model of interest is consistent, optimal and
tractable. Similarly but separately, we show that a recent line of works that
use stochastic gradient for Hybrid Monte Carlo (HMC) sampling also preserve
differentially privacy with minor or no modifications of the algorithmic
procedure at all, these observations lead to an "anytime" algorithm for
Bayesian learning under privacy constraint. We demonstrate that it performs
much better than the stateoftheart differential private methods on synthetic
and real datasets.

We study a novel splinelike basis, which we name the "falling factorial
basis", bearing many similarities to the classic truncated power basis. The
advantage of the falling factorial basis is that it enables rapid, lineartime
computations in basis matrix multiplication and basis matrix inversion. The
falling factorial functions are not actually splines, but are close enough to
splines that they provably retain some of the favorable properties of the
latter functions. We examine their application in two problems: trend filtering
over arbitrary input points, and a higherorder variant of the twosample
KolmogorovSmirnov test.

Classical methods such as Principal Component Analysis (PCA) and Canonical
Correlation Analysis (CCA) are ubiquitous in statistics. However, these
techniques are only able to reveal linear relationships in data. Although
nonlinear variants of PCA and CCA have been proposed, these are computationally
prohibitive in the large scale.
In a separate strand of recent research, randomized methods have been
proposed to construct features that help reveal nonlinear patterns in data. For
basic tasks such as regression or classification, random features exhibit
little or no loss in performance, while achieving drastic savings in
computational requirements.
In this paper we leverage randomness to design scalable new variants of
nonlinear PCA and CCA; our ideas extend to key multivariate analysis tools such
as spectral clustering or LDA. We demonstrate our algorithms through
experiments on realworld data, on which we compare against the
stateoftheart. A simple R implementation of the presented algorithms is
provided.

We describe Hokusai, a real time system which is able to capture frequency
information for streams of arbitrary sequences of symbols. The algorithm uses
the CountMin sketch as its basis and exploits the fact that sketching is
linear. It provides real time statistics of arbitrary events, e.g. streams of
queries as a function of time. We use a factorizing approximation to provide
point estimates at arbitrary (time, item) combinations. Queries can be answered
in constant time.

In this paper we de ne conditional random elds in reproducing kernel Hilbert
spaces and show connections to Gaussian Process classi cation. More speci
cally, we prove decomposition results for undirected graphical models and we
give constructions for kernels. Finally we present e cient means of solving the
optimization problem using reduced rank decompositions and we show how
stationarity can be exploited e ciently in the optimization process.

We extend the herding algorithm to continuous spaces by using the kernel
trick. The resulting "kernel herding" algorithm is an infinite memory
deterministic process that learns to approximate a PDF with a collection of
samples. We show that kernel herding decreases the error of expectations of
functions in the Hilbert space at a rate O(1/T) which is much faster than the
usual O(1/pT) for iid random samples. We illustrate kernel herding by
approximating Bayesian predictive distributions.

This paper analyses the problem of Gaussian process (GP) bandits with
deterministic observations. The analysis uses a branch and bound algorithm that
is related to the UCB algorithm of (Srinivas et al., 2010). For GPs with
Gaussian observation noise, with variance strictly greater than zero, (Srinivas
et al., 2010) proved that the regret vanishes at the approximate rate of
$O(\frac{1}{\sqrt{t}})$, where t is the number of observations. To complement
their result, we attack the deterministic case and attain a much faster
exponential convergence rate. Under some regularity assumptions, we show that
the regret decreases asymptotically according to $O(e^{\frac{\tau t}{(\ln
t)^{d/4}}})$ with high probability. Here, d is the dimension of the search
space and $\tau$ is a constant that depends on the behaviour of the objective
function near its global maximum.

In this work we study parallelization of online learning, a core primitive in
machine learning. In a parallel environment all known approaches for parallel
online learning lead to delayed updates, where the model is updated using
outofdate information. In the worst case, or when examples are temporally
correlated, delay can have a very adverse effect on the learning algorithm.
Here, we analyze and present preliminary empirical results on a set of learning
architectures based on a feature sharding approach that present various
tradeoffs between delay, degree of parallelism, representation power and
empirical performance.

Empirical evidence suggests that hashing is an effective strategy for
dimensionality reduction and practical nonparametric estimation. In this paper
we provide exponential tail bounds for feature hashing and show that the
interaction between random subspaces is negligible with high probability. We
demonstrate the feasibility of this approach with experimental results for a
new use case  multitask learning with hundreds of thousands of tasks.

We introduce a framework for filtering features that employs the
HilbertSchmidt Independence Criterion (HSIC) as a measure of dependence
between the features and the labels. The key idea is that good features should
maximise such dependence. Feature selection for various supervised learning
problems (including classification and regression) is unified under this
framework, and the solutions can be approximated using a backwardelimination
algorithm. We demonstrate the usefulness of our method on both artificial and
real world datasets.