
In many scientific and engineering applications, we are tasked with the
maximisation of an expensive to evaluate black box function $f$. Traditional
settings for this problem assume just the availability of this single function.
However, in many cases, cheap approximations to $f$ may be obtainable. For
example, the expensive real world behaviour of a robot can be approximated by a
cheap computer simulation. We can use these approximations to eliminate low
function value regions cheaply and use the expensive evaluations of $f$ in a
small but promising region and speedily identify the optimum. We formalise this
task as a \emph{multifidelity} bandit problem where the target function and
its approximations are sampled from a Gaussian process. We develop MFGPUCB, a
novel method based on upper confidence bound techniques. In our theoretical
analysis we demonstrate that it exhibits precisely the above behaviour, and
achieves better regret than strategies which ignore multifidelity information.
Empirically, MFGPUCB outperforms such naive strategies and other
multifidelity methods on several synthetic and real experiments.

Bayesian Optimisation (BO) refers to a class of methods for global
optimisation of a function $f$ which is only accessible via point evaluations.
It is typically used in settings where $f$ is expensive to evaluate. A common
use case for BO in machine learning is model selection, where it is not
possible to analytically model the generalisation performance of a statistical
model, and we resort to noisy and expensive training and validation procedures
to choose the best model. Conventional BO methods have focused on Euclidean and
categorical domains, which, in the context of model selection, only permits
tuning scalar hyperparameters of machine learning algorithms. However, with
the surge of interest in deep learning, there is an increasing demand to tune
neural network \emph{architectures}. In this work, we develop NASBOT, a
Gaussian process based BO framework for neural architecture search. To
accomplish this, we develop a distance metric in the space of neural network
architectures which can be computed efficiently via an optimal transport
program. This distance might be of independent interest to the deep learning
community as it may find applications outside of BO. We demonstrate that NASBOT
outperforms other alternatives for architecture search in several cross
validation based model selection tasks on multilayer perceptrons and
convolutional neural networks.

We design and analyse variations of the classical Thompson sampling (TS)
procedure for Bayesian optimisation (BO) in settings where function evaluations
are expensive, but can be performed in parallel. Our theoretical analysis shows
that a direct application of the sequential Thompson sampling algorithm in
either synchronous or asynchronous parallel settings yields a surprisingly
powerful result: making $n$ evaluations distributed among $M$ workers is
essentially equivalent to performing $n$ evaluations in sequence. Further, by
modeling the time taken to complete a function evaluation, we show that, under
a time constraint, asynchronously parallel TS achieves asymptotically lower
regret than both the synchronous and sequential versions. These results are
complemented by an experimental analysis, showing that asynchronous TS
outperforms a suite of existing parallel BO algorithms in simulations and in a
hyperparameter tuning application in convolutional neural networks. In
addition to these, the proposed procedure is conceptually and computationally
much simpler than existing work for parallel BO.

Bandit methods for blackbox optimisation, such as Bayesian optimisation, are
used in a variety of applications including hyperparameter tuning and
experiment design. Recently, \emph{multifidelity} methods have garnered
considerable attention since function evaluations have become increasingly
expensive in such applications. Multifidelity methods use cheap approximations
to the function of interest to speed up the overall optimisation process.
However, most multifidelity methods assume only a finite number of
approximations. In many practical applications however, a continuous spectrum
of approximations might be available. For instance, when tuning an expensive
neural network, one might choose to approximate the cross validation
performance using less data $N$ and/or few training iterations $T$. Here, the
approximations are best viewed as arising out of a continuous two dimensional
space $(N,T)$. In this work, we develop a Bayesian optimisation method, BOCA,
for this setting. We characterise its theoretical properties and show that it
achieves better regret than than strategies which ignore the approximations.
BOCA outperforms several other baselines in synthetic and real experiments.

We study reinforcement learning of chatbots with recurrent neural network
architectures when the rewards are noisy and expensive to obtain. For instance,
a chatbot used in automated customer service support can be scored by quality
assurance agents, but this process can be expensive, time consuming and noisy.
Previous reinforcement learning work for natural language processing uses
onpolicy updates and/or is designed for online learning settings. We
demonstrate empirically that such strategies are not appropriate for this
setting and develop an offpolicy batch policy gradient method (BPG). We
demonstrate the efficacy of our method via a series of synthetic experiments
and an Amazon Mechanical Turk experiment on a restaurant recommendations
dataset.

A common problem in disciplines of applied Statistics research such as
Astrostatistics is of estimating the posterior distribution of relevant
parameters. Typically, the likelihoods for such models are computed via
expensive experiments such as cosmological simulations of the universe. An
urgent challenge in these research domains is to develop methods that can
estimate the posterior with few likelihood evaluations.
In this paper, we study active posterior estimation in a Bayesian setting
when the likelihood is expensive to evaluate. Existing techniques for posterior
estimation are based on generating samples representative of the posterior.
Such methods do not consider efficiency in terms of likelihood evaluations. In
order to be query efficient we treat posterior estimation in an active
regression framework. We propose two myopic query strategies to choose where to
evaluate the likelihood and implement them using Gaussian processes. Via
experiments on a series of synthetic and real examples we demonstrate that our
approach is significantly more query efficient than existing techniques and
other heuristics for posterior estimation.

We study a variant of the classical stochastic $K$armed bandit where
observing the outcome of each arm is expensive, but cheap approximations to
this outcome are available. For example, in online advertising the performance
of an ad can be approximated by displaying it for shorter time periods or to
narrower audiences. We formalise this task as a multifidelity bandit, where,
at each time step, the forecaster may choose to play an arm at any one of $M$
fidelities. The highest fidelity (desired outcome) expends cost
$\lambda^{(m)}$. The $m^{\text{th}}$ fidelity (an approximation) expends
$\lambda^{(m)} < \lambda^{(M)}$ and returns a biased estimate of the highest
fidelity. We develop MFUCB, a novel upper confidence bound procedure for this
setting and prove that it naturally adapts to the sequence of available
approximations and costs thus attaining better regret than naive strategies
which ignore the approximations. For instance, in the above online advertising
example, MFUCB would use the lower fidelities to quickly eliminate suboptimal
ads and reserve the larger expensive experiments on a small set of promising
candidates. We complement this result with a lower bound and show that MFUCB
is nearly optimal under certain conditions.

Recently, there has been a surge of interest in using spectral methods for
estimating latent variable models. However, it is usually assumed that the
distribution of the observations conditioned on the latent variables is either
discrete or belongs to a parametric family. In this paper, we study the
estimation of an $m$state hidden Markov model (HMM) with only smoothness
assumptions, such as H\"olderian conditions, on the emission densities. By
leveraging some recent advances in continuous linear algebra and numerical
analysis, we develop a computationally efficient spectral algorithm for
learning nonparametric HMMs. Our technique is based on computing an SVD on
nonparametric estimates of density functions by viewing them as
\emph{continuous matrices}. We derive sample complexity bounds via
concentration results for nonparametric density estimation and novel
perturbation theory results for continuous matrices. We implement our method
using Chebyshev polynomial approximations. Our method is competitive with other
baselines on synthetic and real problems and is also very computationally
efficient.

High dimensional nonparametric regression is an inherently difficult problem
with known lower bounds depending exponentially in dimension. A popular
strategy to alleviate this curse of dimensionality has been to use additive
models of \emph{first order}, which model the regression function as a sum of
independent functions on each dimension. Though useful in controlling the
variance of the estimate, such models are often too restrictive in practical
settings. Between nonadditive models which often have large variance and first
order additive models which have large bias, there has been little work to
exploit the tradeoff in the middle via additive models of intermediate order.
In this work, we propose SALSA, which bridges this gap by allowing interactions
between variables, but controls model capacity by limiting the order of
interactions. SALSA minimises the residual sum of squares with squared RKHS
norm penalties. Algorithmically, it can be viewed as Kernel Ridge Regression
with an additive kernel. When the regression function is additive, the excess
risk is only polynomial in dimension. Using the GirardNewton formulae, we
efficiently sum over a combinatorial number of terms in the additive expansion.
Via a comparison on $15$ real datasets, we show that our method is competitive
against $21$ other alternatives.

Bayesian Optimisation (BO) is a technique used in optimising a
$D$dimensional function which is typically expensive to evaluate. While there
have been many successes for BO in low dimensions, scaling it to high
dimensions has been notoriously difficult. Existing literature on the topic are
under very restrictive settings. In this paper, we identify two key challenges
in this endeavour. We tackle these challenges by assuming an additive structure
for the function. This setting is substantially more expressive and contains a
richer class of functions than previous work. We prove that, for additive
functions the regret has only linear dependence on $D$ even though the function
depends on all $D$ dimensions. We also demonstrate several other statistical
and computational benefits in our framework. Via synthetic examples, a
scientific simulation and a face detection problem we demonstrate that our
method outperforms naive BO on additive functions and on several examples where
the function is not additive.

We propose and analyze estimators for statistical functionals of one or more
distributions under nonparametric assumptions. Our estimators are based on the
theory of influence functions, which appear in the semiparametric statistics
literature. We show that estimators based either on datasplitting or a
leaveoneout technique enjoy fast rates of convergence and other favorable
theoretical properties. We apply this framework to derive estimators for
several popular information theoretic quantities, and via empirical evaluation,
show the advantage of this approach over existing estimators.

We give a comprehensive theoretical characterization of a nonparametric
estimator for the $L_2^2$ divergence between two continuous distributions. We
first bound the rate of convergence of our estimator, showing that it is
$\sqrt{n}$consistent provided the densities are sufficiently smooth. In this
smooth regime, we then show that our estimator is asymptotically normal,
construct asymptotic confidence intervals, and establish a BerryEss\'{e}en
style inequality characterizing the rate of convergence to normality. We also
show that this estimator is minimax optimal.

We consider nonparametric estimation of $L_2$, Renyi$\alpha$ and
Tsallis$\alpha$ divergences between continuous distributions. Our approach is
to construct estimators for particular integral functionals of two densities
and translate them into divergence estimators. For the integral functionals,
our estimators are based on corrections of a preliminary plugin estimator. We
show that these estimators achieve the parametric convergence rate of
$n^{1/2}$ when the densities' smoothness, $s$, are both at least $d/4$ where
$d$ is the dimension. We also derive minimax lower bounds for this problem
which confirm that $s > d/4$ is necessary to achieve the $n^{1/2}$ rate of
convergence. We validate our theoretical guarantees with a number of
simulations.