
In a traditional setting, classifiers are trained to approximate a target
function $f:X \rightarrow Y$ where at least a sample for each $y \in Y$ is
presented to the training algorithm. In a zeroshot setting we have a subset of
the labels $\hat{Y} \subset Y$ for which we do not observe any corresponding
training instance. Still, the function $f$ that we train must be able to
correctly assign labels also on $\hat{Y}$. In practice, zeroshot problems are
very important especially when the label set is large and the cost of
editorially label samples for all possible values in the label set might be
prohibitively high. Most recent approaches to zeroshot learning are based on
finding and exploiting relationships between labels using semantic embeddings.
We show in this paper that semantic embeddings, despite being very good at
capturing relationships between labels, are not very good at capturing the
relationships among labels in a datadependent manner. For this reason, we
propose a novel twostep process for learning a zeroshot classifier. In the
first step, we learn what we call a \emph{property embedding space} capturing
the "\emph{learnable}" features of the label set. Then, we exploit the learned
properties in order to reduce the generalization error for a linear nearest
neighborbased classifier.

We address the problem of algorithmic fairness: ensuring that sensitive
variables do not unfairly influence the outcome of a classifier. We present an
approach based on empirical risk minimization, which incorporates a fairness
constraint into the learning problem. It encourages the conditional risk of the
learned classifier to be approximately constant with respect to the sensitive
variable. We derive both risk and fairness bounds that support the statistical
consistency of our approach. We specify our approach to kernel methods and
observe that the fairness requirement implies an orthogonality constraint which
can be easily added to these methods. We further observe that for linear models
the constraint translates into a simple data preprocessing step. Experiments
indicate that the method is empirically effective and performs favorably
against stateoftheart approaches.

Systematic reviews are essential to summarizing the results of different
clinical and social science studies. The first step in a systematic review task
is to identify all the studies relevant to the review. The task of identifying
relevant studies for a given systematic review is usually performed manually,
and as a result, involves substantial amounts of expensive human resource.
Lately, there have been some attempts to reduce this manual effort using active
learning. In this work, we build upon some such existing techniques, and
validate by experimenting on a larger and comprehensive dataset than has been
attempted until now. Our experiments provide insights on the use of different
feature extraction models for different disciplines. More importantly, we
identify that a naive active learning based screening process is biased in
favour of selecting similar documents. We aimed to improve the performance of
the screening process using a novel active learning algorithm with success.
Additionally, we propose a mechanism to choose the best feature extraction
method for a given review.

Increasing availability of vehicle GPS data has created potentially
transformative opportunities for traffic management, route planning and other
locationbased services. Critical to the utility of the data is their accuracy.
Mapmatching is the process of improving the accuracy by aligning GPS data with
the road network. In this paper, we propose a purely probabilistic approach to
mapmatching based on a sequential Monte Carlo algorithm known as particle
filters. The approach performs mapmatching by producing a range of candidate
solutions, each with an associated probability score. We outline implementation
details and thoroughly validate the technique on GPS data of varied quality.

The ability to track a moving vehicle is of crucial importance in numerous
applications. The task has often been approached by the importance sampling
technique of particle filters due to its ability to model nonlinear and
nonGaussian dynamics, of which a vehicle travelling on a road network is a
good example. Particle filters perform poorly when observations are highly
informative. In this paper, we address this problem by proposing particle
filters that sample around the most recent observation. The proposal leads to
an order of magnitude improvement in accuracy and efficiency over conventional
particle filters, especially when observations are infrequent but lownoise.

We introduce the localized Lasso, which is suited for learning models that
are both interpretable and have a high predictive power in problems with high
dimensionality $d$ and small sample size $n$. More specifically, we consider a
function defined by local sparse models, one at each data point. We introduce
samplewise network regularization to borrow strength across the models, and
samplewise exclusive group sparsity (a.k.a., $\ell_{1,2}$ norm) to introduce
diversity into the choice of feature sets in the local models. The local models
are interpretable in terms of similarity of their sparsity patterns. The cost
function is convex, and thus has a globally optimal solution. Moreover, we
propose a simple yet efficient iterative leastsquares based optimization
procedure for the localized Lasso, which does not need a tuning parameter, and
is guaranteed to converge to a globally optimal solution. The solution is
empirically shown to outperform alternatives for both simulated and genomic
personalized medicine data.

This paper presents eight PACBayes bounds to analyze the generalization
performance of multiview classifiers. These bounds adopt data dependent
Gaussian priors which emphasize classifiers with high view agreements. The
center of the prior for the first two bounds is the origin, while the center of
the prior for the third and fourth bounds is given by a data dependent vector.
An important technique to obtain these bounds is two derived logarithmic
determinant inequalities whose difference lies in whether the dimensionality of
data is involved. The centers of the fifth and sixth bounds are calculated on a
separate subset of the training set. The last two bounds use unlabeled data to
represent view agreements and are thus applicable to semisupervised multiview
learning. We evaluate all the presented multiview PACBayes bounds on
benchmark data and compare them with previous singleview PACBayes bounds. The
usefulness and performance of the multiview bounds are discussed.

A contentbased image retrieval system based on multinomial relevance
feedback is proposed. The system relies on an interactive search paradigm where
at each round a user is presented with k images and selects the one closest to
their ideal target. Two approaches, one based on the Dirichlet distribution and
one based the Beta distribution, are used to model the problem motivating an
algorithm that trades exploration and exploitation in presenting the images in
each round. Experimental results show that the new approach compares favourably
with previous work.

We investigate a paradigm in multitask reinforcement learning (MTRL) in
which an agent is placed in an environment and needs to learn to perform a
series of tasks, within this space. Since the environment does not change,
there is potentially a lot of common ground amongst tasks and learning to solve
them individually seems extremely wasteful. In this paper, we explicitly model
and learn this shared structure as it arises in the stateaction value space.
We will show how one can jointly learn optimal valuefunctions by modifying the
popular ValueIteration and PolicyIteration procedures to accommodate this
shared representation assumption and leverage the power of multitask
supervised learning. Finally, we demonstrate that the proposed model and
training procedures, are able to infer good value functions, even under low
samples regimes. In addition to data efficiency, we will show in our analysis,
that learning abstractions of the state space jointly across tasks leads to
more robust, transferable representations with the potential for better
generalization. this shared representation assumption and leverage the power of
multitask supervised learning. Finally, we demonstrate that the proposed model
and training procedures, are able to infer good value functions, even under low
samples regimes. In addition to data efficiency, we will show in our analysis,
that learning abstractions of the state space jointly across tasks leads to
more robust, transferable representations with the potential for better
generalization.

This paper describes PinView, a contentbased image retrieval system that
exploits implicit relevance feedback collected during a search session. PinView
contains several novel methods to infer the intent of the user. From relevance
feedback, such as eye movements or pointer clicks, and visual features of
images, PinView learns a similarity metric between images which depends on the
current interests of the user. It then retrieves images with a specialized
online learning algorithm that balances the tradeoff between exploring new
images and exploiting the already inferred interests of the user. We have
integrated PinView to the contentbased image retrieval system PicSOM, which
enables applying PinView to realworld image databases. With the new algorithms
PinView outperforms the original PicSOM, and in online experiments with real
users the combination of implicit and explicit feedback gives the best results.

In this paper, we correct an upper bound, presented in~\cite{hs11}, on the
generalisation error of classifiers learned through multiple kernel learning.
The bound in~\cite{hs11} uses Rademacher complexity and has an\emph{additive}
dependence on the logarithm of the number of kernels and the margin achieved by
the classifier. However, there are some errors in parts of the proof which are
corrected in this paper. Unfortunately, the final result turns out to be a risk
bound which has a \emph{multiplicative} dependence on the logarithm of the
number of kernels and the margin achieved by the classifier.

We study the task of retrieving relevant experiments given a query
experiment. By experiment, we mean a collection of measurements from a set of
`covariates' and the associated `outcomes'. While similar experiments can be
retrieved by comparing available `annotations', this approach ignores the
valuable information available in the measurements themselves. To incorporate
this information in the retrieval task, we suggest employing a retrieval metric
that utilizes probabilistic models learned from the measurements. We argue that
such a metric is a sensible measure of similarity between two experiments since
it permits inclusion of experimentspecific prior knowledge. However, accurate
models are often not analytical, and one must resort to storing posterior
samples which demands considerable resources. Therefore, we study strategies to
select informative posterior samples to reduce the computational load while
maintaining the retrieval performance. We demonstrate the efficacy of our
approach on simulated data with simple linear regression as the models, and
real world datasets.

Recent nonlinear feature selection approaches employing greedy optimisation
of Centred Kernel Target Alignment(KTA) exhibit strong results in terms of
generalisation accuracy and sparsity. However, they are computationally
prohibitive for large datasets. We propose randSel, a randomised feature
selection algorithm, with attractive scaling properties. Our theoretical
analysis of randSel provides strong probabilistic guarantees for correct
identification of relevant features. RandSel's characteristics make it an ideal
candidate for identifying informative learned representations. We've conducted
experimentation to establish the performance of this approach, and present
encouraging results, including a 3rd position result in the recent ICML black
box learning challenge as well as competitive results for signal peptide
prediction, an important problem in bioinformatics.

Feature selection plays a pivotal role in learning, particularly in areas
were parsimonious features can provide insight into the underlying process,
such as biology. Recent approaches for nonlinear feature selection employing
greedy optimisation of Centred Kernel Target Alignment(KTA), while exhibiting
strong results in terms of generalisation accuracy and sparsity, can become
computationally prohibitive for highdimensional datasets. We propose randSel,
a randomised feature selection algorithm, with attractive scaling properties.
Our theoretical analysis of randSel provides strong probabilistic guarantees
for the correct identification of relevant features. Experimental results on
real and artificial data, show that the method successfully identifies
effective features, performing better than a number of competitive approaches.

The ICML 2013 Workshop on Challenges in Representation Learning focused on
three challenges: the black box learning challenge, the facial expression
recognition challenge, and the multimodal learning challenge. We describe the
datasets created for these challenges and summarize the results of the
competitions. We provide suggestions for organizers of future challenges and
some comments on what kind of knowledge can be gained from machine learning
competitions.

Canonical correlation analysis is a statistical technique that is used to
find relations between two sets of variables. An important extension in pattern
analysis is to consider more than two sets of variables. This problem can be
expressed as a quadratically constrained quadratic program (QCQP), commonly
referred to Multiset Canonical Correlation Analysis (MCCA). This is a
nonconvex problem and so greedy algorithms converge to local optima without
any guarantees on global optimality. In this paper, we show that despite being
highly structured, finding the optimal solution is NPHard. This motivates our
relaxation of the QCQP to a semidefinite program (SDP). The SDP is convex, can
be solved reasonably efficiently and comes with both absolute and
outputsensitive approximation quality. In addition to theoretical guarantees,
we do an extensive comparison of the QCQP method and the SDP relaxation on a
variety of synthetic and real world data. Finally, we present two useful
extensions: we incorporate kernel methods and computing multiple sets of
canonical vectors.

We present a set of highprobability inequalities that control the
concentration of weighted averages of multiple (possibly uncountably many)
simultaneously evolving and interdependent martingales. Our results extend the
PACBayesian analysis in learning theory from the i.i.d. setting to martingales
opening the way for its application to importance weighted sampling,
reinforcement learning, and other interactive learning domains, as well as many
other domains in probability theory and statistics, where martingales are
encountered.
We also present a comparison inequality that bounds the expectation of a
convex function of a martingale difference sequence shifted to the [0,1]
interval by the expectation of the same function of independent Bernoulli
variables. This inequality is applied to derive a tighter analog of
HoeffdingAzuma's inequality.

Nonnegative matrix factorization (NMF) approximates a nonnegative matrix
$X$ by a product of two nonnegative lowrank factor matrices $W$ and $H$. NMF
and its extensions minimize either the KullbackLeibler divergence or the
Euclidean distance between $X$ and $W^T H$ to model the Poisson noise or the
Gaussian noise. In practice, when the noise distribution is heavy tailed, they
cannot perform well. This paper presents Manhattan NMF (MahNMF) which minimizes
the Manhattan distance between $X$ and $W^T H$ for modeling the heavy tailed
Laplacian noise. Similar to sparse and lowrank matrix decompositions, MahNMF
robustly estimates the lowrank part and the sparse part of a nonnegative
matrix and thus performs effectively when data are contaminated by outliers. We
extend MahNMF for various practical applications by developing boxconstrained
MahNMF, manifold regularized MahNMF, group sparse MahNMF, elastic net inducing
MahNMF, and symmetric MahNMF. The major contribution of this paper lies in two
fast optimization algorithms for MahNMF and its extensions: the rankone
residual iteration (RRI) method and Nesterov's smoothing method. In particular,
by approximating the residual matrix by the outer product of one row of W and
one row of $H$ in MahNMF, we develop an RRI method to iteratively update each
variable of $W$ and $H$ in a closed form solution. Although RRI is efficient
for small scale MahNMF and some of its extensions, it is neither scalable to
large scale matrices nor flexible enough to optimize all MahNMF extensions.
Since the objective functions of MahNMF and its extensions are neither convex
nor smooth, we apply Nesterov's smoothing method to recursively optimize one
factor matrix with another matrix fixed. By setting the smoothing parameter
inversely proportional to the iteration number, we improve the approximation
accuracy iteratively for both MahNMF and its extensions.

We develop a new tool for datadependent analysis of the
explorationexploitation tradeoff in learning under limited feedback. Our tool
is based on two main ingredients. The first ingredient is a new concentration
inequality that makes it possible to control the concentration of weighted
averages of multiple (possibly uncountably many) simultaneously evolving and
interdependent martingales. The second ingredient is an application of this
inequality to the explorationexploitation tradeoff via importance weighted
sampling. We apply the new tool to the stochastic multiarmed bandit problem,
however, the main importance of this paper is the development and understanding
of the new tool rather than improvement of existing algorithms for stochastic
multiarmed bandits. In the followup work we demonstrate that the new tool can
improve over stateoftheart in structurally richer problems, such as
stochastic multiarmed bandits with side information (Seldin et al., 2011a).

GRavitational lEnsing Accuracy Testing 2010 (GREAT10) is a public image
analysis challenge aimed at the development of algorithms to analyze
astronomical images. Specifically, the challenge is to measure varying image
distortions in the presence of a variable convolution kernel, pixelization and
noise. This is the second in a series of challenges set to the astronomy,
computer science and statistics communities, providing a structured environment
in which methods can be improved and tested in preparation for planned
astronomical surveys. GREAT10 extends upon previous work by introducing
variable fields into the challenge. The "Galaxy Challenge" involves the precise
measurement of galaxy shape distortions, quantified locally by two parameters
called shear, in the presence of a known convolution kernel. Crucially, the
convolution kernel and the simulated gravitational lensing shape distortion
both now vary as a function of position within the images, as is the case for
real data. In addition, we introduce the "Star Challenge" that concerns the
reconstruction of a variable convolution kernel, similar to that in a typical
astronomical observation. This document details the GREAT10 Challenge for
potential participants. Continually updated information is also available from
http://www.greatchallenges.info.

We propose a method to efficiently construct datadependent kernels which can
make use of large quantities of (unlabeled) data. Our construction makes an
approximation in the standard construction of semisupervised kernels in
Sindhwani et al. 2005. In typical cases these kernels can be computed in
nearlylinear time (in the amount of data), improving on the cubic time of the
standard construction, enabling large scale semisupervised learning in a
variety of contexts. The methods are validated on semisupervised and
unsupervised problems on data sets containing upto 64,000 sample points.

We develop a coherent framework for integrative simultaneous analysis of the
explorationexploitation and model order selection tradeoffs. We improve over
our preceding results on the same subject (Seldin et al., 2011) by combining
PACBayesian analysis with Bernsteintype inequality for martingales. Such a
combination is also of independent interest for studies of multiple
simultaneously evolving martingales.

We present two alternative ways to apply PACBayesian analysis to sequences
of dependent random variables. The first is based on a new lemma that enables
to bound expectations of convex functions of certain dependent random variables
by expectations of the same functions of independent Bernoulli random
variables. This lemma provides an alternative tool to HoeffdingAzuma
inequality to bound concentration of martingale values. Our second approach is
based on integration of HoeffdingAzuma inequality with PACBayesian analysis.
We also introduce a way to apply PACBayesian analysis in situation of limited
feedback. We combine the new tools to derive PACBayesian generalization and
regret bounds for the multiarmed bandit problem. Although our regret bound is
not yet as tight as stateoftheart regret bounds based on other
wellestablished techniques, our results significantly expand the range of
potential applications of PACBayesian analysis and introduce a new analysis
tool to reinforcement learning and many other fields, where martingales and
limited feedback are encountered.

We motivate and analyse a new Tree Search algorithm, GPTS, based on recent
theoretical advances in the use of Gaussian Processes for Bandit problems. We
consider tree paths as arms and we assume the target/reward function is drawn
from a GP distribution. The posterior mean and variance, after observing data,
are used to define confidence intervals for the function values, and we
sequentially play arms with highest upper confidence bounds. We give an
efficient implementation of GPTS and we adapt previous regret bounds by
determining the decay rate of the eigenvalues of the kernel matrix on the whole
set of tree paths. We consider two kernels in the feature space of binary
vectors indexed by the nodes of the tree: linear and Gaussian. The regret grows
in square root of the number of iterations T, up to a logarithmic factor, with
a constant that improves with bigger Gaussian kernel widths. We focus on
practical values of T, smaller than the number of arms. Finally, we apply GPTS
to Open Loop Planning in discounted Markov Decision Processes by modelling the
reward as a discounted sum of independent Gaussian Processes. We report similar
regret bounds to those of the OLOP algorithm.

Multiple Kernel Learning (MKL) is used to replicate the signal combination
process that trading rules embody when they aggregate multiple sources of
financial information when predicting an asset's price movements. A set of
financially motivated kernels is constructed for the EURUSD currency pair and
is used to predict the direction of price movement for the currency over
multiple time horizons. MKL is shown to outperform each of the kernels
individually in terms of predictive accuracy. Furthermore, the kernel
weightings selected by MKL highlights which of the financial features
represented by the kernels are the most informative for predictive tasks.