
In many scientific and engineering applications, we are tasked with the
maximisation of an expensive to evaluate black box function $f$. Traditional
settings for this problem assume just the availability of this single function.
However, in many cases, cheap approximations to $f$ may be obtainable. For
example, the expensive real world behaviour of a robot can be approximated by a
cheap computer simulation. We can use these approximations to eliminate low
function value regions cheaply and use the expensive evaluations of $f$ in a
small but promising region and speedily identify the optimum. We formalise this
task as a \emph{multifidelity} bandit problem where the target function and
its approximations are sampled from a Gaussian process. We develop MFGPUCB, a
novel method based on upper confidence bound techniques. In our theoretical
analysis we demonstrate that it exhibits precisely the above behaviour, and
achieves better regret than strategies which ignore multifidelity information.
Empirically, MFGPUCB outperforms such naive strategies and other
multifidelity methods on several synthetic and real experiments.

We develop an automated technique for detecting damped Lyman$\alpha$
absorbers (DLAs) along spectroscopic lines of sight to quasistellar objects
(QSOs or quasars). The detection of DLAs in largescale spectroscopic surveys
such as SDSSIII sheds light on galaxy formation at high redshift, showing the
nucleation of galaxies from diffuse gas. We use nearly 50 000 QSO spectra to
learn a novel tailored Gaussian process model for quasar emission spectra,
which we apply to the DLA detection problem via Bayesian model selection. We
propose models for identifying an arbitrary number of DLAs along a given line
of sight. We demonstrate our method's effectiveness using a largescale
validation experiment, with excellent performance. We also provide a catalog of
our results applied to 162 858 spectra from SDSSIII data release 12.

The fundamental task of general density estimation $p(x)$ has been of keen
interest to machine learning. In this work, we attempt to systematically
characterize methods for density estimation. Broadly speaking, most of the
existing methods can be categorized into either using: \textit{a})
autoregressive models to estimate the conditional factors of the chain rule,
$p(x_{i}\, \, x_{i1}, \ldots)$; or \textit{b}) nonlinear transformations of
variables of a simple base distribution. Based on the study of the
characteristics of these categories, we propose multiple novel methods for each
category. For example we proposed RNN based transformations to model
nonMarkovian dependencies. Further, through a comprehensive study over both
real world and synthetic data, we show for that jointly leveraging
transformations of variables and autoregressive conditional models, results in
a considerable improvement in performance. We illustrate the use of our models
in outlier detection and image modeling. Finally we introduce a novel data
driven framework for learning a family of distributions.

Bayesian Optimisation (BO) refers to a class of methods for global
optimisation of a function $f$ which is only accessible via point evaluations.
It is typically used in settings where $f$ is expensive to evaluate. A common
use case for BO in machine learning is model selection, where it is not
possible to analytically model the generalisation performance of a statistical
model, and we resort to noisy and expensive training and validation procedures
to choose the best model. Conventional BO methods have focused on Euclidean and
categorical domains, which, in the context of model selection, only permits
tuning scalar hyperparameters of machine learning algorithms. However, with
the surge of interest in deep learning, there is an increasing demand to tune
neural network \emph{architectures}. In this work, we develop NASBOT, a
Gaussian process based BO framework for neural architecture search. To
accomplish this, we develop a distance metric in the space of neural network
architectures which can be computed efficiently via an optimal transport
program. This distance might be of independent interest to the deep learning
community as it may find applications outside of BO. We demonstrate that NASBOT
outperforms other alternatives for architecture search in several cross
validation based model selection tasks on multilayer perceptrons and
convolutional neural networks.

Kernel methods are ubiquitous tools in machine learning. However, there is
often little reason for the common practice of selecting a kernel a priori.
Even if a universal approximating kernel is selected, the quality of the finite
sample estimator may be greatly affected by the choice of kernel. Furthermore,
when directly applying kernel methods, one typically needs to compute a $N
\times N$ Gram matrix of pairwise kernel evaluations to work with a dataset of
$N$ instances. The computation of this Gram matrix precludes the direct
application of kernel methods on large datasets, and makes kernel learning
especially difficult. In this paper we introduce Bayesian nonparmetric
kernellearning (BaNK), a generic, datadriven framework for scalable learning
of kernels. BaNK places a nonparametric prior on the spectral distribution of
random frequencies allowing it to both learn kernels and scale to large
datasets. We show that this framework can be used for large scale regression
and classification tasks. Furthermore, we show that BaNK outperforms several
other scalable approaches for kernel learning on a variety of real world
datasets.

Active Search has become an increasingly useful tool in information retrieval
problems where the goal is to discover as many target elements as possible
using only limited label queries. With the advent of big data, there is a
growing emphasis on the scalability of such techniques to handle very large and
very complex datasets.
In this paper, we consider the problem of Active Search where we are given a
similarity function between data points. We look at an algorithm introduced by
Wang et al. [2013] for Active Search over graphs and propose crucial
modifications which allow it to scale significantly. Their approach selects
points by minimizing an energy function over the graph induced by the
similarity function on the data. Our modifications require the similarity
function to be a dotproduct between feature vectors of data points, equivalent
to having a linear kernel for the adjacency matrix. With this, we are able to
scale tremendously: for $n$ data points, the original algorithm runs in
$O(n^2)$ time per iteration while ours runs in only $O(nr + r^2)$ given
$r$dimensional features.
We also describe a simple alternate approach using a weightedneighbor
predictor which also scales well. In our experiments, we show that our method
is competitive with existing semisupervised approaches. We also briefly
discuss conditions under which our algorithm performs well.

We propose to study equivariance in deep neural networks through parameter
symmetries. In particular, given a group $\mathcal{G}$ that acts discretely on
the input and output of a standard neural network layer $\phi_{W}: \Re^{M} \to
\Re^{N}$, we show that $\phi_{W}$ is equivariant with respect to
$\mathcal{G}$action iff $\mathcal{G}$ explains the symmetries of the network
parameters $W$. Inspired by this observation, we then propose two
parametersharing schemes to induce the desirable symmetry on $W$. Our
procedures for tying the parameters achieve $\mathcal{G}$equivariance and,
under some conditions on the action of $\mathcal{G}$, they guarantee
sensitivity to all other permutation groups outside $\mathcal{G}$.

This paper presents the recurrent estimation of distributions (RED) for
modeling realvalued data in a semiparametric fashion. RED models make two
novel uses of recurrent neural networks (RNNs) for density estimation of
general realvalued data. First, RNNs are used to transform input covariates
into a latent space to better capture conditional dependencies in inputs.
After, an RNN is used to compute the conditional distributions of the latent
covariates. The resulting model is efficient to train, compute, and sample
from, whilst producing normalized pdfs. The effectiveness of RED is shown via
several realworld data experiments. Our results show that RED models achieve a
lower heldout negative loglikelihood than other neural network approaches
across multiple dataset sizes and dimensionalities. Further context of the
efficacy of RED is provided by considering anomaly detection tasks, where we
also observe better performance over alternative models.

We design and analyse variations of the classical Thompson sampling (TS)
procedure for Bayesian optimisation (BO) in settings where function evaluations
are expensive, but can be performed in parallel. Our theoretical analysis shows
that a direct application of the sequential Thompson sampling algorithm in
either synchronous or asynchronous parallel settings yields a surprisingly
powerful result: making $n$ evaluations distributed among $M$ workers is
essentially equivalent to performing $n$ evaluations in sequence. Further, by
modeling the time taken to complete a function evaluation, we show that, under
a time constraint, asynchronously parallel TS achieves asymptotically lower
regret than both the synchronous and sequential versions. These results are
complemented by an experimental analysis, showing that asynchronous TS
outperforms a suite of existing parallel BO algorithms in simulations and in a
hyperparameter tuning application in convolutional neural networks. In
addition to these, the proposed procedure is conceptually and computationally
much simpler than existing work for parallel BO.

Bandit methods for blackbox optimisation, such as Bayesian optimisation, are
used in a variety of applications including hyperparameter tuning and
experiment design. Recently, \emph{multifidelity} methods have garnered
considerable attention since function evaluations have become increasingly
expensive in such applications. Multifidelity methods use cheap approximations
to the function of interest to speed up the overall optimisation process.
However, most multifidelity methods assume only a finite number of
approximations. In many practical applications however, a continuous spectrum
of approximations might be available. For instance, when tuning an expensive
neural network, one might choose to approximate the cross validation
performance using less data $N$ and/or few training iterations $T$. Here, the
approximations are best viewed as arising out of a continuous two dimensional
space $(N,T)$. In this work, we develop a Bayesian optimisation method, BOCA,
for this setting. We characterise its theoretical properties and show that it
achieves better regret than than strategies which ignore the approximations.
BOCA outperforms several other baselines in synthetic and real experiments.

Sophisticated gated recurrent neural network architectures like LSTMs and
GRUs have been shown to be highly effective in a myriad of applications. We
develop an ungated unit, the statistical recurrent unit (SRU), that is able to
learn long term dependencies in data by only keeping moving averages of
statistics. The SRU's architecture is simple, ungated, and contains a
comparable number of parameters to LSTMs; yet, SRUs perform favorably to more
sophisticated LSTM and GRU alternatives, often outperforming one or both in
various tasks. We show the efficacy of SRUs as compared to LSTMs and GRUs in an
unbiased manner by optimizing respective architectures' hyperparameters in a
Bayesian optimization scheme for both synthetic and realworld tasks.

We introduce a simple permutation equivariant layer for deep learning with
set structure.This type of layer, obtained by parametersharing, has a simple
implementation and lineartime complexity in the size of each set. We use deep
permutationinvariant networks to perform pointcould classification and
MNISTdigit summation, where in both cases the output is invariant to
permutations of the input. In a semisupervised setting, where the goal is make
predictions for each instance within a set, we demonstrate the usefulness of
this type of layer in setoutlier detection as well as semisupervised learning
with clustering sideinformation.

A common problem in disciplines of applied Statistics research such as
Astrostatistics is of estimating the posterior distribution of relevant
parameters. Typically, the likelihoods for such models are computed via
expensive experiments such as cosmological simulations of the universe. An
urgent challenge in these research domains is to develop methods that can
estimate the posterior with few likelihood evaluations.
In this paper, we study active posterior estimation in a Bayesian setting
when the likelihood is expensive to evaluate. Existing techniques for posterior
estimation are based on generating samples representative of the posterior.
Such methods do not consider efficiency in terms of likelihood evaluations. In
order to be query efficient we treat posterior estimation in an active
regression framework. We propose two myopic query strategies to choose where to
evaluate the likelihood and implement them using Gaussian processes. Via
experiments on a series of synthetic and real examples we demonstrate that our
approach is significantly more query efficient than existing techniques and
other heuristics for posterior estimation.

Autonomous systems can be used to search for sparse signals in a large space;
e.g., aerial robots can be deployed to localize threats, detect gas leaks, or
respond to distress calls. Intuitively, search algorithms may increase
efficiency by collecting aggregate measurements summarizing large contiguous
regions. However, most existing search methods either ignore the possibility of
such region observations (e.g., Bayesian optimization and multiarmed bandits)
or make strong assumptions about the sensing mechanism that allow each
measurement to arbitrarily encode all signals in the entire environment (e.g.,
compressive sensing). We propose an algorithm that actively collects data to
search for sparse signals using only noisy measurements of the average values
on rectangular regions (including single points), based on the greedy
maximization of information gain. We analyze our algorithm in 1d and show that
it requires $\tilde{O}(\frac{n}{\mu^2}+k^2)$ measurements to recover all of $k$
signal locations with small Bayes error, where $\mu$ and $n$ are the signal
strength and the size of the search space, respectively. We also show that
active designs can be fundamentally more efficient than passive designs with
region sensing, contrasting with the results of AriasCastro, Candes, and
Davenport (2013). We demonstrate the empirical performance of our algorithm on
a search problem using satellite image data and in high dimensions.

Understanding the nature of dark energy, the mysterious force driving the
accelerated expansion of the Universe, is a major challenge of modern
cosmology. The next generation of cosmological surveys, specifically designed
to address this issue, rely on accurate measurements of the apparent shapes of
distant galaxies. However, shape measurement methods suffer from various
unavoidable biases and therefore will rely on a precise calibration to meet the
accuracy requirements of the science analysis. This calibration process remains
an open challenge as it requires large sets of high quality galaxy images. To
this end, we study the application of deep conditional generative models in
generating realistic galaxy images. In particular we consider variations on
conditional variational autoencoder and introduce a new adversarial objective
for training of conditional generative networks. Our results suggest a reliable
alternative to the acquisition of expensive high quality observations for
generating the calibration data needed by the next generation of cosmological
surveys.

We study a variant of the classical stochastic $K$armed bandit where
observing the outcome of each arm is expensive, but cheap approximations to
this outcome are available. For example, in online advertising the performance
of an ad can be approximated by displaying it for shorter time periods or to
narrower audiences. We formalise this task as a multifidelity bandit, where,
at each time step, the forecaster may choose to play an arm at any one of $M$
fidelities. The highest fidelity (desired outcome) expends cost
$\lambda^{(m)}$. The $m^{\text{th}}$ fidelity (an approximation) expends
$\lambda^{(m)} < \lambda^{(M)}$ and returns a biased estimate of the highest
fidelity. We develop MFUCB, a novel upper confidence bound procedure for this
setting and prove that it naturally adapts to the sequence of available
approximations and costs thus attaining better regret than naive strategies
which ignore the approximations. For instance, in the above online advertising
example, MFUCB would use the lower fidelities to quickly eliminate suboptimal
ads and reserve the larger expensive experiments on a small set of promising
candidates. We complement this result with a lower bound and show that MFUCB
is nearly optimal under certain conditions.

We propose a Laplace approximation that creates a stochastic unit from any
smooth monotonic activation function, using only Gaussian noise. This paper
investigates the application of this stochastic approximation in training a
family of Restricted Boltzmann Machines (RBM) that are closely linked to
Bregman divergences. This family, that we call exponential family RBM
(ExpRBM), is a subset of the exponential family Harmoniums that expresses
family members through a choice of smooth monotonic nonlinearity for each
neuron. Using contrastive divergence along with our Gaussian approximation, we
show that ExpRBM can learn useful representations using novel stochastic
units.

Bayesian Optimisation (BO) is a technique used in optimising a
$D$dimensional function which is typically expensive to evaluate. While there
have been many successes for BO in low dimensions, scaling it to high
dimensions has been notoriously difficult. Existing literature on the topic are
under very restrictive settings. In this paper, we identify two key challenges
in this endeavour. We tackle these challenges by assuming an additive structure
for the function. This setting is substantially more expressive and contains a
richer class of functions than previous work. We prove that, for additive
functions the regret has only linear dependence on $D$ even though the function
depends on all $D$ dimensions. We also demonstrate several other statistical
and computational benefits in our framework. Via synthetic examples, a
scientific simulation and a face detection problem we demonstrate that our
method outperforms naive BO on additive functions and on several examples where
the function is not additive.

Total energies of crystal structures can be calculated to high precision
using quantumbased density functional theory (DFT) methods, but the
calculations can be time consuming and scale badly with system size. Cluster
expansions of total energy as a linear superposition of pair, triplet and
higher interactions can efficiently approximate the total energies but are best
suited to simple lattice structures. To model the total energy of boron
carbide, with a complex crystal structure, we explore the utility of machine
learning methods ($L_1$penalized regression, neural network, Gaussian process
and support vector regression) that capture certain nonlinear effects
associated with manybody interactions despite requiring only pair frequencies
as input. Our interaction models are combined with Monte Carlo simulations to
evaluate the thermodynamics of chemical ordering.

The use of distributions and highlevel features from deep architecture has
become commonplace in modern computer vision. Both of these methodologies have
separately achieved a great deal of success in many computer vision tasks.
However, there has been little work attempting to leverage the power of these
to methodologies jointly. To this end, this paper presents the Deep Mean Maps
(DMMs) framework, a novel family of methods to nonparametrically represent
distributions of features in convolutional neural network models.
DMMs are able to both classify images using the distribution of toplevel
features, and to tune the toplevel features for performing this task. We show
how to implement DMMs using a special mean map layer composed of typical CNN
operations, making both forward and backward propagation simple.
We illustrate the efficacy of DMMs at analyzing distributional patterns in
image data in a synthetic data experiment. We also show that we extending
existing deep architectures with DMMs improves the performance of existing CNNs
on several challenging realworld datasets.

Many interesting machine learning problems are best posed by considering
instances that are distributions, or sample sets drawn from distributions.
Previous work devoted to machine learning tasks with distributional inputs has
done so through pairwise kernel evaluations between pdfs (or sample sets).
While such an approach is fine for smaller datasets, the computation of an $N
\times N$ Gram matrix is prohibitive in large datasets. Recent scalable
estimators that work over pdfs have done so only with kernels that use
Euclidean metrics, like the $L_2$ distance. However, there are a myriad of
other useful metrics available, such as total variation, Hellinger distance,
and the JensenShannon divergence. This work develops the first random features
for pdfs whose dot product approximates kernels using these nonEuclidean
metrics, allowing estimators using such kernels to scale to large datasets by
working in a primal space, without computing large Gram matrices. We provide an
analysis of the approximation error in using our proposed random features and
show empirically the quality of our approximation both in estimating a Gram
matrix and in solving learning tasks in realworld and synthetic data.

Kernel methods give powerful, flexible, and theoretically grounded approaches
to solving many problems in machine learning. The standard approach, however,
requires pairwise evaluations of a kernel function, which can lead to
scalability issues for very large datasets. Rahimi and Recht (2007) suggested a
popular approach to handling this problem, known as random Fourier features.
The quality of this approximation, however, is not well understood. We improve
the uniform error bound of that paper, as well as giving novel understandings
of the embedding's variance, approximation error, and use in some machine
learning methods. We also point out that surprisingly, of the two main variants
of those features, the more widely used is strictly highervariance for the
Gaussian kernel and has worse bounds.

We present a modern machine learning approach for cluster dynamical mass
measurements that is a factor of two improvement over using a conventional
scaling relation. Different methods are tested against a mock cluster catalog
constructed using halos with mass >= 10^14 Msolar/h from Multidark's
publiclyavailable Nbody MDPL halo catalog. In the conventional method, we use
a standard M(sigma_v) power law scaling relation to infer cluster mass, M, from
lineofsight (LOS) galaxy velocity dispersion, sigma_v. The resulting
fractional mass error distribution is broad, with width=0.87 (68% scatter), and
has extended higherror tails. The standard scaling relation can be simply
enhanced by including higherorder moments of the LOS velocity distribution.
Applying the kurtosis as a correction term to log(sigma_v) reduces the width of
the error distribution to 0.74 (16% improvement). Machine learning can be used
to take full advantage of all the information in the velocity distribution. We
employ the Support Distribution Machines (SDMs) algorithm that learns from
distributions of data to predict single values. SDMs trained and tested on the
distribution of LOS velocities yield width=0.46 (47% improvement). Furthermore,
the problematic tails of the mass error distribution are effectively
eliminated. Decreasing cluster mass errors will improve measurements of the
growth of structure and lead to tighter constraints on cosmological parameters.

We analyze the problem of regression when both input covariates and output
responses are functions from a nonparametric function class. Function to
function regression (FFR) covers a large range of interesting applications
including timeseries prediction problems, and also more general tasks like
studying a mapping between two separate types of distributions. However,
previous nonparametric estimators for FFR type problems scale badly
computationally with the number of input/output pairs in a dataset. Given the
complexity of a mapping between general functions it may be necessary to
consider large datasets in order to achieve a low estimation risk. To address
this issue, we develop a novel scalable nonparametric estimator, the
TripleBasis Estimator (3BE), which is capable of operating over datasets with
many instances. To the best of our knowledge, the 3BE is the first
nonparametric FFR estimator that can scale to massive datasets. We analyze the
3BE's risk and derive an upperbound rate. Furthermore, we show an improvement
of several orders of magnitude in terms of prediction speed and a reduction in
error over previous estimators in various realworld datasets.

We consider the problem of inferring constraints on a highdimensional
parameter space with a computationally expensive likelihood function. We
propose a machine learning algorithm that maps out the Frequentist confidence
limit on parameter space by intelligently targeting likelihood evaluations so
as to quickly and accurately characterize the likelihood surface in both low
and highlikelihood regions. We compare our algorithm to Bayesian credible
limits derived by the welltested Markov Chain Monte Carlo (MCMC) algorithm
using both multimodal toy likelihood functions and the 7year WMAP cosmic
microwave background likelihood function. We find that our algorithm correctly
identifies the location, general size, and general shape of highlikelihood
regions in parameter space while being more robust against multimodality than
MCMC.