• In a traditional setting, classifiers are trained to approximate a target function $f:X \rightarrow Y$ where at least a sample for each $y \in Y$ is presented to the training algorithm. In a zero-shot setting we have a subset of the labels $\hat{Y} \subset Y$ for which we do not observe any corresponding training instance. Still, the function $f$ that we train must be able to correctly assign labels also on $\hat{Y}$. In practice, zero-shot problems are very important especially when the label set is large and the cost of editorially label samples for all possible values in the label set might be prohibitively high. Most recent approaches to zero-shot learning are based on finding and exploiting relationships between labels using semantic embeddings. We show in this paper that semantic embeddings, despite being very good at capturing relationships between labels, are not very good at capturing the relationships among labels in a data-dependent manner. For this reason, we propose a novel two-step process for learning a zero-shot classifier. In the first step, we learn what we call a \emph{property embedding space} capturing the "\emph{learnable}" features of the label set. Then, we exploit the learned properties in order to reduce the generalization error for a linear nearest neighbor-based classifier.
  • We address the problem of algorithmic fairness: ensuring that sensitive variables do not unfairly influence the outcome of a classifier. We present an approach based on empirical risk minimization, which incorporates a fairness constraint into the learning problem. It encourages the conditional risk of the learned classifier to be approximately constant with respect to the sensitive variable. We derive both risk and fairness bounds that support the statistical consistency of our approach. We specify our approach to kernel methods and observe that the fairness requirement implies an orthogonality constraint which can be easily added to these methods. We further observe that for linear models the constraint translates into a simple data preprocessing step. Experiments indicate that the method is empirically effective and performs favorably against state-of-the-art approaches.
  • Systematic reviews are essential to summarizing the results of different clinical and social science studies. The first step in a systematic review task is to identify all the studies relevant to the review. The task of identifying relevant studies for a given systematic review is usually performed manually, and as a result, involves substantial amounts of expensive human resource. Lately, there have been some attempts to reduce this manual effort using active learning. In this work, we build upon some such existing techniques, and validate by experimenting on a larger and comprehensive dataset than has been attempted until now. Our experiments provide insights on the use of different feature extraction models for different disciplines. More importantly, we identify that a naive active learning based screening process is biased in favour of selecting similar documents. We aimed to improve the performance of the screening process using a novel active learning algorithm with success. Additionally, we propose a mechanism to choose the best feature extraction method for a given review.
  • Increasing availability of vehicle GPS data has created potentially transformative opportunities for traffic management, route planning and other location-based services. Critical to the utility of the data is their accuracy. Map-matching is the process of improving the accuracy by aligning GPS data with the road network. In this paper, we propose a purely probabilistic approach to map-matching based on a sequential Monte Carlo algorithm known as particle filters. The approach performs map-matching by producing a range of candidate solutions, each with an associated probability score. We outline implementation details and thoroughly validate the technique on GPS data of varied quality.
  • The ability to track a moving vehicle is of crucial importance in numerous applications. The task has often been approached by the importance sampling technique of particle filters due to its ability to model non-linear and non-Gaussian dynamics, of which a vehicle travelling on a road network is a good example. Particle filters perform poorly when observations are highly informative. In this paper, we address this problem by proposing particle filters that sample around the most recent observation. The proposal leads to an order of magnitude improvement in accuracy and efficiency over conventional particle filters, especially when observations are infrequent but low-noise.
  • We introduce the localized Lasso, which is suited for learning models that are both interpretable and have a high predictive power in problems with high dimensionality $d$ and small sample size $n$. More specifically, we consider a function defined by local sparse models, one at each data point. We introduce sample-wise network regularization to borrow strength across the models, and sample-wise exclusive group sparsity (a.k.a., $\ell_{1,2}$ norm) to introduce diversity into the choice of feature sets in the local models. The local models are interpretable in terms of similarity of their sparsity patterns. The cost function is convex, and thus has a globally optimal solution. Moreover, we propose a simple yet efficient iterative least-squares based optimization procedure for the localized Lasso, which does not need a tuning parameter, and is guaranteed to converge to a globally optimal solution. The solution is empirically shown to outperform alternatives for both simulated and genomic personalized medicine data.
  • This paper presents eight PAC-Bayes bounds to analyze the generalization performance of multi-view classifiers. These bounds adopt data dependent Gaussian priors which emphasize classifiers with high view agreements. The center of the prior for the first two bounds is the origin, while the center of the prior for the third and fourth bounds is given by a data dependent vector. An important technique to obtain these bounds is two derived logarithmic determinant inequalities whose difference lies in whether the dimensionality of data is involved. The centers of the fifth and sixth bounds are calculated on a separate subset of the training set. The last two bounds use unlabeled data to represent view agreements and are thus applicable to semi-supervised multi-view learning. We evaluate all the presented multi-view PAC-Bayes bounds on benchmark data and compare them with previous single-view PAC-Bayes bounds. The usefulness and performance of the multi-view bounds are discussed.
  • A content-based image retrieval system based on multinomial relevance feedback is proposed. The system relies on an interactive search paradigm where at each round a user is presented with k images and selects the one closest to their ideal target. Two approaches, one based on the Dirichlet distribution and one based the Beta distribution, are used to model the problem motivating an algorithm that trades exploration and exploitation in presenting the images in each round. Experimental results show that the new approach compares favourably with previous work.
  • We investigate a paradigm in multi-task reinforcement learning (MT-RL) in which an agent is placed in an environment and needs to learn to perform a series of tasks, within this space. Since the environment does not change, there is potentially a lot of common ground amongst tasks and learning to solve them individually seems extremely wasteful. In this paper, we explicitly model and learn this shared structure as it arises in the state-action value space. We will show how one can jointly learn optimal value-functions by modifying the popular Value-Iteration and Policy-Iteration procedures to accommodate this shared representation assumption and leverage the power of multi-task supervised learning. Finally, we demonstrate that the proposed model and training procedures, are able to infer good value functions, even under low samples regimes. In addition to data efficiency, we will show in our analysis, that learning abstractions of the state space jointly across tasks leads to more robust, transferable representations with the potential for better generalization. this shared representation assumption and leverage the power of multi-task supervised learning. Finally, we demonstrate that the proposed model and training procedures, are able to infer good value functions, even under low samples regimes. In addition to data efficiency, we will show in our analysis, that learning abstractions of the state space jointly across tasks leads to more robust, transferable representations with the potential for better generalization.
  • This paper describes PinView, a content-based image retrieval system that exploits implicit relevance feedback collected during a search session. PinView contains several novel methods to infer the intent of the user. From relevance feedback, such as eye movements or pointer clicks, and visual features of images, PinView learns a similarity metric between images which depends on the current interests of the user. It then retrieves images with a specialized online learning algorithm that balances the tradeoff between exploring new images and exploiting the already inferred interests of the user. We have integrated PinView to the content-based image retrieval system PicSOM, which enables applying PinView to real-world image databases. With the new algorithms PinView outperforms the original PicSOM, and in online experiments with real users the combination of implicit and explicit feedback gives the best results.
  • In this paper, we correct an upper bound, presented in~\cite{hs-11}, on the generalisation error of classifiers learned through multiple kernel learning. The bound in~\cite{hs-11} uses Rademacher complexity and has an\emph{additive} dependence on the logarithm of the number of kernels and the margin achieved by the classifier. However, there are some errors in parts of the proof which are corrected in this paper. Unfortunately, the final result turns out to be a risk bound which has a \emph{multiplicative} dependence on the logarithm of the number of kernels and the margin achieved by the classifier.
  • We study the task of retrieving relevant experiments given a query experiment. By experiment, we mean a collection of measurements from a set of `covariates' and the associated `outcomes'. While similar experiments can be retrieved by comparing available `annotations', this approach ignores the valuable information available in the measurements themselves. To incorporate this information in the retrieval task, we suggest employing a retrieval metric that utilizes probabilistic models learned from the measurements. We argue that such a metric is a sensible measure of similarity between two experiments since it permits inclusion of experiment-specific prior knowledge. However, accurate models are often not analytical, and one must resort to storing posterior samples which demands considerable resources. Therefore, we study strategies to select informative posterior samples to reduce the computational load while maintaining the retrieval performance. We demonstrate the efficacy of our approach on simulated data with simple linear regression as the models, and real world datasets.
  • Recent non-linear feature selection approaches employing greedy optimisation of Centred Kernel Target Alignment(KTA) exhibit strong results in terms of generalisation accuracy and sparsity. However, they are computationally prohibitive for large datasets. We propose randSel, a randomised feature selection algorithm, with attractive scaling properties. Our theoretical analysis of randSel provides strong probabilistic guarantees for correct identification of relevant features. RandSel's characteristics make it an ideal candidate for identifying informative learned representations. We've conducted experimentation to establish the performance of this approach, and present encouraging results, including a 3rd position result in the recent ICML black box learning challenge as well as competitive results for signal peptide prediction, an important problem in bioinformatics.
  • Feature selection plays a pivotal role in learning, particularly in areas were parsimonious features can provide insight into the underlying process, such as biology. Recent approaches for non-linear feature selection employing greedy optimisation of Centred Kernel Target Alignment(KTA), while exhibiting strong results in terms of generalisation accuracy and sparsity, can become computationally prohibitive for high-dimensional datasets. We propose randSel, a randomised feature selection algorithm, with attractive scaling properties. Our theoretical analysis of randSel provides strong probabilistic guarantees for the correct identification of relevant features. Experimental results on real and artificial data, show that the method successfully identifies effective features, performing better than a number of competitive approaches.
  • The ICML 2013 Workshop on Challenges in Representation Learning focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for organizers of future challenges and some comments on what kind of knowledge can be gained from machine learning competitions.
  • Canonical correlation analysis is a statistical technique that is used to find relations between two sets of variables. An important extension in pattern analysis is to consider more than two sets of variables. This problem can be expressed as a quadratically constrained quadratic program (QCQP), commonly referred to Multi-set Canonical Correlation Analysis (MCCA). This is a non-convex problem and so greedy algorithms converge to local optima without any guarantees on global optimality. In this paper, we show that despite being highly structured, finding the optimal solution is NP-Hard. This motivates our relaxation of the QCQP to a semidefinite program (SDP). The SDP is convex, can be solved reasonably efficiently and comes with both absolute and output-sensitive approximation quality. In addition to theoretical guarantees, we do an extensive comparison of the QCQP method and the SDP relaxation on a variety of synthetic and real world data. Finally, we present two useful extensions: we incorporate kernel methods and computing multiple sets of canonical vectors.
  • We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PAC-Bayesian analysis in learning theory from the i.i.d. setting to martingales opening the way for its application to importance weighted sampling, reinforcement learning, and other interactive learning domains, as well as many other domains in probability theory and statistics, where martingales are encountered. We also present a comparison inequality that bounds the expectation of a convex function of a martingale difference sequence shifted to the [0,1] interval by the expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma's inequality.
  • Non-negative matrix factorization (NMF) approximates a non-negative matrix $X$ by a product of two non-negative low-rank factor matrices $W$ and $H$. NMF and its extensions minimize either the Kullback-Leibler divergence or the Euclidean distance between $X$ and $W^T H$ to model the Poisson noise or the Gaussian noise. In practice, when the noise distribution is heavy tailed, they cannot perform well. This paper presents Manhattan NMF (MahNMF) which minimizes the Manhattan distance between $X$ and $W^T H$ for modeling the heavy tailed Laplacian noise. Similar to sparse and low-rank matrix decompositions, MahNMF robustly estimates the low-rank part and the sparse part of a non-negative matrix and thus performs effectively when data are contaminated by outliers. We extend MahNMF for various practical applications by developing box-constrained MahNMF, manifold regularized MahNMF, group sparse MahNMF, elastic net inducing MahNMF, and symmetric MahNMF. The major contribution of this paper lies in two fast optimization algorithms for MahNMF and its extensions: the rank-one residual iteration (RRI) method and Nesterov's smoothing method. In particular, by approximating the residual matrix by the outer product of one row of W and one row of $H$ in MahNMF, we develop an RRI method to iteratively update each variable of $W$ and $H$ in a closed form solution. Although RRI is efficient for small scale MahNMF and some of its extensions, it is neither scalable to large scale matrices nor flexible enough to optimize all MahNMF extensions. Since the objective functions of MahNMF and its extensions are neither convex nor smooth, we apply Nesterov's smoothing method to recursively optimize one factor matrix with another matrix fixed. By setting the smoothing parameter inversely proportional to the iteration number, we improve the approximation accuracy iteratively for both MahNMF and its extensions.
  • We develop a new tool for data-dependent analysis of the exploration-exploitation trade-off in learning under limited feedback. Our tool is based on two main ingredients. The first ingredient is a new concentration inequality that makes it possible to control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. The second ingredient is an application of this inequality to the exploration-exploitation trade-off via importance weighted sampling. We apply the new tool to the stochastic multiarmed bandit problem, however, the main importance of this paper is the development and understanding of the new tool rather than improvement of existing algorithms for stochastic multiarmed bandits. In the follow-up work we demonstrate that the new tool can improve over state-of-the-art in structurally richer problems, such as stochastic multiarmed bandits with side information (Seldin et al., 2011a).
  • GRavitational lEnsing Accuracy Testing 2010 (GREAT10) is a public image analysis challenge aimed at the development of algorithms to analyze astronomical images. Specifically, the challenge is to measure varying image distortions in the presence of a variable convolution kernel, pixelization and noise. This is the second in a series of challenges set to the astronomy, computer science and statistics communities, providing a structured environment in which methods can be improved and tested in preparation for planned astronomical surveys. GREAT10 extends upon previous work by introducing variable fields into the challenge. The "Galaxy Challenge" involves the precise measurement of galaxy shape distortions, quantified locally by two parameters called shear, in the presence of a known convolution kernel. Crucially, the convolution kernel and the simulated gravitational lensing shape distortion both now vary as a function of position within the images, as is the case for real data. In addition, we introduce the "Star Challenge" that concerns the reconstruction of a variable convolution kernel, similar to that in a typical astronomical observation. This document details the GREAT10 Challenge for potential participants. Continually updated information is also available from http://www.greatchallenges.info.
  • We propose a method to efficiently construct data-dependent kernels which can make use of large quantities of (unlabeled) data. Our construction makes an approximation in the standard construction of semi-supervised kernels in Sindhwani et al. 2005. In typical cases these kernels can be computed in nearly-linear time (in the amount of data), improving on the cubic time of the standard construction, enabling large scale semi-supervised learning in a variety of contexts. The methods are validated on semi-supervised and unsupervised problems on data sets containing upto 64,000 sample points.
  • We develop a coherent framework for integrative simultaneous analysis of the exploration-exploitation and model order selection trade-offs. We improve over our preceding results on the same subject (Seldin et al., 2011) by combining PAC-Bayesian analysis with Bernstein-type inequality for martingales. Such a combination is also of independent interest for studies of multiple simultaneously evolving martingales.
  • We present two alternative ways to apply PAC-Bayesian analysis to sequences of dependent random variables. The first is based on a new lemma that enables to bound expectations of convex functions of certain dependent random variables by expectations of the same functions of independent Bernoulli random variables. This lemma provides an alternative tool to Hoeffding-Azuma inequality to bound concentration of martingale values. Our second approach is based on integration of Hoeffding-Azuma inequality with PAC-Bayesian analysis. We also introduce a way to apply PAC-Bayesian analysis in situation of limited feedback. We combine the new tools to derive PAC-Bayesian generalization and regret bounds for the multiarmed bandit problem. Although our regret bound is not yet as tight as state-of-the-art regret bounds based on other well-established techniques, our results significantly expand the range of potential applications of PAC-Bayesian analysis and introduce a new analysis tool to reinforcement learning and many other fields, where martingales and limited feedback are encountered.
  • We motivate and analyse a new Tree Search algorithm, GPTS, based on recent theoretical advances in the use of Gaussian Processes for Bandit problems. We consider tree paths as arms and we assume the target/reward function is drawn from a GP distribution. The posterior mean and variance, after observing data, are used to define confidence intervals for the function values, and we sequentially play arms with highest upper confidence bounds. We give an efficient implementation of GPTS and we adapt previous regret bounds by determining the decay rate of the eigenvalues of the kernel matrix on the whole set of tree paths. We consider two kernels in the feature space of binary vectors indexed by the nodes of the tree: linear and Gaussian. The regret grows in square root of the number of iterations T, up to a logarithmic factor, with a constant that improves with bigger Gaussian kernel widths. We focus on practical values of T, smaller than the number of arms. Finally, we apply GPTS to Open Loop Planning in discounted Markov Decision Processes by modelling the reward as a discounted sum of independent Gaussian Processes. We report similar regret bounds to those of the OLOP algorithm.
  • Multiple Kernel Learning (MKL) is used to replicate the signal combination process that trading rules embody when they aggregate multiple sources of financial information when predicting an asset's price movements. A set of financially motivated kernels is constructed for the EURUSD currency pair and is used to predict the direction of price movement for the currency over multiple time horizons. MKL is shown to outperform each of the kernels individually in terms of predictive accuracy. Furthermore, the kernel weightings selected by MKL highlights which of the financial features represented by the kernels are the most informative for predictive tasks.