
We consider the problem of learning a highdimensional but lowrank matrix
from a largescale dataset distributed over several machines, where
lowrankness is enforced by a convex trace norm constraint. We propose
DFWTrace, a distributed FrankWolfe algorithm which leverages the lowrank
structure of its updates to achieve efficiency in time, memory and
communication usage. The step at the heart of DFWTrace is solved approximately
using a distributed version of the power method. We provide a theoretical
analysis of the convergence of DFWTrace, showing that we can ensure sublinear
convergence in expectation to an optimal solution with few power iterations per
epoch. We implement DFWTrace in the Apache Spark distributed programming
framework and validate the usefulness of our approach on synthetic and real
data, including the ImageNet dataset with highdimensional features extracted
from a deep neural network.

The amount of personal data collected in our everyday interactions with
connected devices offers great opportunities for innovative services fueled by
machine learning, as well as raises serious concerns for the privacy of
individuals. In this paper, we propose a massively distributed protocol for a
large set of users to privately compute averages over their joint data, which
can then be used to learn predictive models. Our protocol can find a solution
of arbitrary accuracy, does not rely on a third party and preserves the privacy
of users throughout the execution in both the honestbutcurious and malicious
adversary models. Specifically, we prove that the information observed by the
adversary (the set of maliciours users) does not significantly reduce the
uncertainty in its prediction of private values compared to its prior belief.
The level of privacy protection depends on a quantity related to the Laplacian
matrix of the network graph and generally improves with the size of the graph.
Furthermore, we design a verification procedure which offers protection against
malicious users joining the service with the goal of manipulating the outcome
of the algorithm.

The rise of connected personal devices together with privacy concerns call
for machine learning algorithms capable of leveraging the data of a large
number of agents to learn personalized models under strong privacy
requirements. In this paper, we introduce an efficient algorithm to address the
above problem in a fully decentralized (peertopeer) and asynchronous fashion,
with provable convergence rate. We show how to make the algorithm
differentially private to protect against the disclosure of information about
the personal datasets, and formally analyze the tradeoff between utility and
privacy. Our experiments show that our approach dramatically outperforms
previous work in the nonprivate case, and that under privacy constraints, we
can significantly improve over models learned in isolation.

We consider a set of learning agents in a collaborative peertopeer network,
where each agent learns a personalized model according to its own learning
objective. The question addressed in this paper is: how can agents improve upon
their locally trained model by communicating with other agents that have
similar objectives? We introduce and analyze two asynchronous gossip algorithms
running in a fully decentralized manner. Our first approach, inspired from
label propagation, aims to smooth pretrained local models over the network
while accounting for the confidence that each agent has in its initial model.
In our second approach, agents jointly learn and propagate their model by
making iterative updates based on both their local dataset and the behavior of
their neighbors. To optimize this challenging objective, our decentralized
algorithm is based on ADMM.

We study largescale kernel methods for acoustic modeling in speech
recognition and compare their performance to deep neural networks (DNNs). We
perform experiments on four speech recognition datasets, including the TIMIT
and Broadcast News benchmark tasks, and compare these two types of models on
framelevel performance metrics (accuracy, crossentropy), as well as on
recognition metrics (word/character error rate). In order to scale kernel
methods to these large datasets, we use the random Fourier feature method of
Rahimi and Recht (2007). We propose two novel techniques for improving the
performance of kernel acoustic models. First, in order to reduce the number of
random features required by kernel models, we propose a simple but effective
method for feature selection. The method is able to explore a large number of
nonlinear features while maintaining a compact model more efficiently than
existing approaches. Second, we present a number of framelevel metrics which
correlate very strongly with recognition performance when computed on the
heldout set; we take advantage of these correlations by monitoring these
metrics during training in order to decide when to stop learning. This
technique can noticeably improve the recognition performance of both DNN and
kernel models, while narrowing the gap between them. Additionally, we show that
the linear bottleneck method of Sainath et al. (2013) improves the performance
of our kernel models significantly, in addition to speeding up training and
making the models more compact. Together, these three methods dramatically
improve the performance of kernel acoustic models, making their performance
comparable to DNNs on the tasks we explored.

In decentralized networks (of sensors, connected objects, etc.), there is an
important need for efficient algorithms to optimize a global cost function, for
instance to learn a global model from the local data collected by each
computing unit. In this paper, we address the problem of decentralized
minimization of pairwise functions of the data points, where these points are
distributed over the nodes of a graph defining the communication topology of
the network. This general problem finds applications in ranking, distance
metric learning and graph inference, among others. We propose new gossip
algorithms based on dual averaging which aims at solving such problems both in
synchronous and asynchronous settings. The proposed framework is flexible
enough to deal with constrained and regularized variants of the optimization
problem. Our theoretical analysis reveals that the proposed algorithms preserve
the convergence rate of centralized dual averaging up to an additive bias term.
We present numerical simulations on Area Under the ROC Curve (AUC) maximization
and metric learning problems which illustrate the practical interest of our
approach.

In a wide range of statistical learning problems such as ranking, clustering
or metric learning among others, the risk is accurately estimated by
$U$statistics of degree $d\geq 1$, i.e. functionals of the training data with
low variance that take the form of averages over $k$tuples. From a
computational perspective, the calculation of such statistics is highly
expensive even for a moderate sample size $n$, as it requires averaging
$O(n^d)$ terms. This makes learning procedures relying on the optimization of
such data functionals hardly feasible in practice. It is the major goal of this
paper to show that, strikingly, such empirical risks can be replaced by
drastically computationally simpler MonteCarlo estimates based on $O(n)$ terms
only, usually referred to as incomplete $U$statistics, without damaging the
$O_{\mathbb{P}}(1/\sqrt{n})$ learning rate of Empirical Risk Minimization (ERM)
procedures. For this purpose, we establish uniform deviation results describing
the error made when approximating a $U$process by its incomplete version under
appropriate complexity assumptions. Extensions to model selection, fast rate
situations and various sampling techniques are also considered, as well as an
application to stochastic gradient descent for ERM. Finally, numerical examples
are displayed in order to provide strong empirical evidence that the approach
we promote largely surpasses more naive subsampling techniques.

We study largescale kernel methods for acoustic modeling and compare to DNNs
on performance metrics related to both acoustic modeling and recognition.
Measuring perplexity and framelevel classification accuracy, kernelbased
acoustic models are as effective as their DNN counterparts. However, on
tokenerrorrates DNN models can be significantly better. We have discovered
that this might be attributed to DNN's unique strength in reducing both the
perplexity and the entropy of the predicted posterior probabilities. Motivated
by our findings, we propose a new technique, entropy regularized perplexity,
for model selection. This technique can noticeably improve the recognition
performance of both types of models, and reduces the gap between them. While
effective on Broadcast News, this technique could be also applicable to other
tasks.

Efficient and robust algorithms for decentralized estimation in networks are
essential to many distributed systems. Whereas distributed estimation of sample
mean statistics has been the subject of a good deal of attention, computation
of $U$statistics, relying on more expensive averaging over pairs of
observations, is a less investigated area. Yet, such data functionals are
essential to describe global properties of a statistical population, with
important examples including Area Under the Curve, empirical variance, Gini
mean difference and withincluster point scatter. This paper proposes new
synchronous and asynchronous randomized gossip algorithms which simultaneously
propagate data across the network and maintain local estimates of the
$U$statistic of interest. We establish convergence rate bounds of $O(1/t)$ and
$O(\log t / t)$ for the synchronous and asynchronous cases respectively, where
$t$ is the number of iterations, with explicit data and network dependent
terms. Beyond favorable comparisons in terms of rate analysis, numerical
experiments provide empirical evidence the proposed algorithms surpasses the
previously introduced approach.

A good measure of similarity between data points is crucial to many tasks in
machine learning. Similarity and metric learning methods learn such measures
automatically from data, but they do not scale well respect to the
dimensionality of the data. In this paper, we propose a method that can learn
efficiently similarity measure from highdimensional sparse data. The core idea
is to parameterize the similarity measure as a convex combination of rankone
matrices with specific sparsity structures. The parameters are then optimized
with an approximate FrankWolfe procedure to maximally satisfy relative
similarity constraints on the training data. Our algorithm greedily
incorporates one pair of features at a time into the similarity measure,
providing an efficient way to control the number of active features and thus
reduce overfitting. It enjoys very appealing convergence guarantees and its
time and memory complexity depends on the sparsity of the data instead of the
dimension of the feature space. Our experiments on realworld highdimensional
datasets demonstrate its potential for classification, dimensionality reduction
and data exploration.

The computational complexity of kernel methods has often been a major barrier
for applying them to largescale learning problems. We argue that this barrier
can be effectively overcome. In particular, we develop methods to scale up
kernel models to successfully tackle largescale learning problems that are so
far only approachable by deep learning architectures. Based on the seminal work
by Rahimi and Recht on approximating kernel functions with features derived
from random projections, we advance the stateoftheart by proposing methods
that can efficiently train models with hundreds of millions of parameters, and
learn optimal representations from multiple kernels. We conduct extensive
empirical studies on problems from image recognition and automatic speech
recognition, and show that the performance of our kernel models matches that of
wellengineered deep neural nets (DNNs). To the best of our knowledge, this is
the first time that a direct comparison between these two methods on
largescale problems is reported. Our kernel methods have several appealing
properties: training with convex optimization, cost for training a single model
comparable to DNNs, and significantly reduced total cost due to fewer
hyperparameters to tune for model selection. Our contrastive study between
these two very different but equally competitive models sheds light on
fundamental questions such as how to learn good representations.

Learning sparse combinations is a frequent theme in machine learning. In this
paper, we study its associated optimization problem in the distributed setting
where the elements to be combined are not centrally located but spread over a
network. We address the key challenges of balancing communication costs and
optimization errors. To this end, we propose a distributed FrankWolfe (dFW)
algorithm. We obtain theoretical guarantees on the optimization error
$\epsilon$ and communication cost that do not depend on the total number of
combining elements. We further show that the communication cost of dFW is
optimal by deriving a lowerbound on the communication cost required to
construct an $\epsilon$approximate solution. We validate our theoretical
analysis with empirical studies on synthetic and realworld data, which
demonstrate that dFW outperforms both baselines and competing methods. We also
study the performance of dFW when the conditions of our analysis are relaxed,
and show that dFW is fairly robust.

Metric learning has attracted a lot of interest over the last decade, but the
generalization ability of such methods has not been thoroughly studied. In this
paper, we introduce an adaptation of the notion of algorithmic robustness
(previously introduced by Xu and Mannor) that can be used to derive
generalization bounds for metric learning. We further show that a weak notion
of robustness is in fact a necessary and sufficient condition for a metric
learning algorithm to generalize. To illustrate the applicability of the
proposed framework, we derive generalization results for a large family of
existing metric learning algorithms, including some sparse formulations that
are not covered by previous results.

We propose a new approach for metric learning by framing it as learning a
sparse combination of locally discriminative metrics that are inexpensive to
generate from the training data. This flexible framework allows us to naturally
derive formulations for global, multitask and local metric learning. The
resulting algorithms have several advantages over existing methods in the
literature: a much smaller number of parameters to be estimated and a
principled way to generalize learned metrics to new testing data points. To
analyze the approach theoretically, we derive a generalization bound that
justifies the sparse combination. Empirically, we evaluate our algorithms on
several datasets against stateoftheart metric learning methods. The results
are consistent with our theoretical findings and demonstrate the superiority of
our approach in terms of classification performance and scalability.

The need for appropriate ways to measure the distance or similarity between
data is ubiquitous in machine learning, pattern recognition and data mining,
but handcrafting such good metrics for specific problems is generally
difficult. This has led to the emergence of metric learning, which aims at
automatically learning a metric from data and has attracted a lot of interest
in machine learning and related fields for the past ten years. This survey
paper proposes a systematic review of the metric learning literature,
highlighting the pros and cons of each approach. We pay particular attention to
Mahalanobis distance metric learning, a wellstudied and successful framework,
but additionally present a wide range of methods that have recently emerged as
powerful alternatives, including nonlinear metric learning, similarity learning
and local metric learning. Recent trends and extensions, such as
semisupervised metric learning, metric learning for histogram data and the
derivation of generalization guarantees, are also covered. Finally, this survey
addresses metric learning for structured data, in particular edit distance
learning, and attempts to give an overview of the remaining challenges in
metric learning for the years to come.

The crucial importance of metrics in machine learning algorithms has led to
an increasing interest in optimizing distance and similarity functions, an area
of research known as metric learning. When data consist of feature vectors, a
large body of work has focused on learning a Mahalanobis distance. Less work
has been devoted to metric learning from structured objects (such as strings or
trees), most of it focusing on optimizing a notion of edit distance. We
identify two important limitations of current metric learning approaches.
First, they allow to improve the performance of local algorithms such as
knearest neighbors, but metric learning for global algorithms (such as linear
classifiers) has not been studied so far. Second, the question of the
generalization ability of metric learning methods has been largely ignored. In
this thesis, we propose theoretical and algorithmic contributions that address
these limitations. Our first contribution is the derivation of a new kernel
function built from learned edit probabilities. Our second contribution is a
novel framework for learning string and tree edit similarities inspired by the
recent theory of (e,g,t)good similarity functions. Using uniform stability
arguments, we establish theoretical guarantees for the learned similarity that
give a bound on the generalization error of a linear classifier built from that
similarity. In our third contribution, we extend these ideas to metric learning
from feature vectors by proposing a bilinear similarity learning method that
efficiently optimizes the (e,g,t)goodness. Generalization guarantees are
derived for our approach, highlighting that our method minimizes a tighter
bound on the generalization error of the classifier. Our last contribution is a
framework for establishing generalization bounds for a large class of existing
metric learning algorithms based on a notion of algorithmic robustness.

In recent years, the crucial importance of metrics in machine learning
algorithms has led to an increasing interest for optimizing distance and
similarity functions. Most of the state of the art focus on learning
Mahalanobis distances (requiring to fulfill a constraint of positive
semidefiniteness) for use in a local kNN algorithm. However, no theoretical
link is established between the learned metrics and their performance in
classification. In this paper, we make use of the formal framework of good
similarities introduced by Balcan et al. to design an algorithm for learning a
non PSD linear similarity optimized in a nonlinear feature space, which is then
used to build a global linear classifier. We show that our approach has uniform
stability and derive a generalization bound on the classification error.
Experiments performed on various datasets confirm the effectiveness of our
approach compared to stateoftheart methods and provide evidence that (i) it
is fast, (ii) robust to overfitting and (iii) produces very sparse classifiers.