
In many scientific and engineering applications, we are tasked with the
maximisation of an expensive to evaluate black box function $f$. Traditional
settings for this problem assume just the availability of this single function.
However, in many cases, cheap approximations to $f$ may be obtainable. For
example, the expensive real world behaviour of a robot can be approximated by a
cheap computer simulation. We can use these approximations to eliminate low
function value regions cheaply and use the expensive evaluations of $f$ in a
small but promising region and speedily identify the optimum. We formalise this
task as a \emph{multifidelity} bandit problem where the target function and
its approximations are sampled from a Gaussian process. We develop MFGPUCB, a
novel method based on upper confidence bound techniques. In our theoretical
analysis we demonstrate that it exhibits precisely the above behaviour, and
achieves better regret than strategies which ignore multifidelity information.
Empirically, MFGPUCB outperforms such naive strategies and other
multifidelity methods on several synthetic and real experiments.

We study the problem of designing models for machine learning tasks defined
on \emph{sets}. In contrast to traditional approach of operating on fixed
dimensional vectors, we consider objective functions defined on sets that are
invariant to permutations. Such problems are widespread, ranging from
estimation of population statistics \cite{poczos13aistats}, to anomaly
detection in piezometer data of embankment dams \cite{Jung15Exploration}, to
cosmology \cite{Ntampaka16Dynamical,Ravanbakhsh16ICML1}. Our main theorem
characterizes the permutation invariant functions and provides a family of
functions to which any permutation invariant objective function must belong.
This family of functions has a special structure which enables us to design a
deep network architecture that can operate on sets and which can be deployed on
a variety of scenarios including both unsupervised and supervised learning
tasks. We also derive the necessary and sufficient conditions for permutation
equivariance in deep models. We demonstrate the applicability of our method on
population statistic estimation, point cloud classification, set expansion, and
outlier detection.

We study estimation of (semi)inner products between two nonparametric
probability distributions, given IID samples from each distribution. These
products include relatively wellstudied classical $\mathcal{L}^2$ and Sobolev
inner products, as well as those induced by translationinvariant reproducing
kernels, for which we believe our results are the first. We first propose
estimators for these quantities, and the induced (semi)norms and
(pseudo)metrics. We then prove nonasymptotic upper bounds on their mean
squared error, in terms of weights both of the inner product and of the two
distributions, in the Fourier basis. Finally, we prove minimax lower bounds
that imply rateoptimality of the proposed estimators over Fourier ellipsoids.

The Wasserstein metric is an important measure of distance between
probability distributions, with several applications in machine learning,
statistics, probability theory, and data analysis. In this paper, we upper and
lower bound minimax rates for the problem of estimating a probability
distribution under Wasserstein loss, in terms of metric properties, such as
covering and packing numbers, of the underlying sample space.

Sparse dictionary learning (SDL) has become a popular method for adaptively
identifying parsimonious representations of a dataset, a fundamental problem in
machine learning and signal processing. While most work on SDL assumes a
training dataset of independent and identically distributed samples, a variant
known as convolutional sparse dictionary learning (CSDL) relaxes this
assumption, allowing more general sequential data sources, such as time series
or other dependent data. Although recent work has explored the statistical
properties of classical SDL, the statistical properties of CSDL remain
unstudied. This paper begins to study this by identifying the minimax
convergence rate of CSDL in terms of reconstruction risk, by both upper
bounding the risk of an established CSDL estimator and proving a matching
informationtheoretic lower bound. Our results indicate that consistency in
reconstruction risk is possible precisely in the `ultrasparse' setting, in
which the sparsity (i.e., the number of feature occurrences) is in $o(N)$ in
terms of the length N of the training sequence. Notably, our results make very
weak assumptions, allowing arbitrary dictionaries and dependent measurement
noise. Finally, we verify our theoretical results with numerical experiments on
synthetic data.

A major challenge in understanding the generalization of deep learning is to
explain why (stochastic) gradient descent can exploit the network architecture
to find solutions that have good generalization performance when using high
capacity models. We find simple but realistic examples showing that this
phenomenon exists even when learning linear classifiers  between two linear
networks with the same capacity, the one with a convolutional layer can
generalize better than the other when the data distribution has some underlying
spatial structure. We argue that this difference results from a combination of
the convolution architecture, data distribution and gradient descent, all of
which are necessary to be included in a meaningful analysis. We provide a
general analysis of the generalization performance as a function of data
distribution and convolutional filter size, given gradient descent as the
optimization algorithm, then interpret the results using concrete examples.
Experimental results show that our analysis is able to explain what happens in
our introduced examples.

The fundamental task of general density estimation $p(x)$ has been of keen
interest to machine learning. In this work, we attempt to systematically
characterize methods for density estimation. Broadly speaking, most of the
existing methods can be categorized into either using: \textit{a})
autoregressive models to estimate the conditional factors of the chain rule,
$p(x_{i}\, \, x_{i1}, \ldots)$; or \textit{b}) nonlinear transformations of
variables of a simple base distribution. Based on the study of the
characteristics of these categories, we propose multiple novel methods for each
category. For example we proposed RNN based transformations to model
nonMarkovian dependencies. Further, through a comprehensive study over both
real world and synthetic data, we show for that jointly leveraging
transformations of variables and autoregressive conditional models, results in
a considerable improvement in performance. We illustrate the use of our models
in outlier detection and image modeling. Finally we introduce a novel data
driven framework for learning a family of distributions.

Bayesian Optimisation (BO) refers to a class of methods for global
optimisation of a function $f$ which is only accessible via point evaluations.
It is typically used in settings where $f$ is expensive to evaluate. A common
use case for BO in machine learning is model selection, where it is not
possible to analytically model the generalisation performance of a statistical
model, and we resort to noisy and expensive training and validation procedures
to choose the best model. Conventional BO methods have focused on Euclidean and
categorical domains, which, in the context of model selection, only permits
tuning scalar hyperparameters of machine learning algorithms. However, with
the surge of interest in deep learning, there is an increasing demand to tune
neural network \emph{architectures}. In this work, we develop NASBOT, a
Gaussian process based BO framework for neural architecture search. To
accomplish this, we develop a distance metric in the space of neural network
architectures which can be computed efficiently via an optimal transport
program. This distance might be of independent interest to the deep learning
community as it may find applications outside of BO. We demonstrate that NASBOT
outperforms other alternatives for architecture search in several cross
validation based model selection tasks on multilayer perceptrons and
convolutional neural networks.

Kernel methods are ubiquitous tools in machine learning. However, there is
often little reason for the common practice of selecting a kernel a priori.
Even if a universal approximating kernel is selected, the quality of the finite
sample estimator may be greatly affected by the choice of kernel. Furthermore,
when directly applying kernel methods, one typically needs to compute a $N
\times N$ Gram matrix of pairwise kernel evaluations to work with a dataset of
$N$ instances. The computation of this Gram matrix precludes the direct
application of kernel methods on large datasets, and makes kernel learning
especially difficult. In this paper we introduce Bayesian nonparmetric
kernellearning (BaNK), a generic, datadriven framework for scalable learning
of kernels. BaNK places a nonparametric prior on the spectral distribution of
random frequencies allowing it to both learn kernels and scale to large
datasets. We show that this framework can be used for large scale regression
and classification tasks. Furthermore, we show that BaNK outperforms several
other scalable approaches for kernel learning on a variety of real world
datasets.

A central challenge to using firstorder methods for optimizing nonconvex
problems is the presence of saddle points. Firstorder methods often get stuck
at saddle points, greatly deteriorating their performance. Typically, to escape
from saddles one has to use secondorder methods. However, most works on
secondorder methods rely extensively on expensive Hessianbased computations,
making them impractical in largescale settings. To tackle this challenge, we
introduce a generic framework that minimizes Hessian based computations while
at the same time provably converging to secondorder critical points. Our
framework carefully alternates between a firstorder and a secondorder
subroutine, using the latter only close to saddle points, and yields
convergence results competitive to the stateoftheart. Empirical results
suggest that our strategy also enjoys a good practical performance.

We propose to study equivariance in deep neural networks through parameter
symmetries. In particular, given a group $\mathcal{G}$ that acts discretely on
the input and output of a standard neural network layer $\phi_{W}: \Re^{M} \to
\Re^{N}$, we show that $\phi_{W}$ is equivariant with respect to
$\mathcal{G}$action iff $\mathcal{G}$ explains the symmetries of the network
parameters $W$. Inspired by this observation, we then propose two
parametersharing schemes to induce the desirable symmetry on $W$. Our
procedures for tying the parameters achieve $\mathcal{G}$equivariance and,
under some conditions on the action of $\mathcal{G}$, they guarantee
sensitivity to all other permutation groups outside $\mathcal{G}$.

This paper presents the recurrent estimation of distributions (RED) for
modeling realvalued data in a semiparametric fashion. RED models make two
novel uses of recurrent neural networks (RNNs) for density estimation of
general realvalued data. First, RNNs are used to transform input covariates
into a latent space to better capture conditional dependencies in inputs.
After, an RNN is used to compute the conditional distributions of the latent
covariates. The resulting model is efficient to train, compute, and sample
from, whilst producing normalized pdfs. The effectiveness of RED is shown via
several realworld data experiments. Our results show that RED models achieve a
lower heldout negative loglikelihood than other neural network approaches
across multiple dataset sizes and dimensionalities. Further context of the
efficacy of RED is provided by considering anomaly detection tasks, where we
also observe better performance over alternative models.

Although gradient descent (GD) almost always escapes saddle points
asymptotically [Lee et al., 2016], this paper shows that even with fairly
natural random initialization schemes and nonpathological functions, GD can be
significantly slowed down by saddle points, taking exponential time to escape.
On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et
al., 2017] is not slowed down by saddle points  it can find an approximate
local minimizer in polynomial time. This result implies that GD is inherently
slower than perturbed GD, and justifies the importance of adding perturbations
for efficient nonconvex optimization. While our focus is theoretical, we also
present experiments that illustrate our theoretical findings.

We consider the Hypothesis Transfer Learning (HTL) problem where one
incorporates a hypothesis trained on the source domain into the learning
procedure of the target domain. Existing theoretical analysis either only
studies specific algorithms or only presents upper bounds on the generalization
error but not on the excess risk. In this paper, we propose a unified
algorithmdependent framework for HTL through a novel notion of transformation
function, which characterizes the relation between the source and the target
domains. We conduct a general risk analysis of this framework and in
particular, we show for the first time, if two domains are related, HTL enjoys
faster convergence rates of excess risks for Kernel Smoothing and Kernel Ridge
Regression than those of the classical nontransfer learning settings.
Experiments on real world data demonstrate the effectiveness of our framework.

We design and analyse variations of the classical Thompson sampling (TS)
procedure for Bayesian optimisation (BO) in settings where function evaluations
are expensive, but can be performed in parallel. Our theoretical analysis shows
that a direct application of the sequential Thompson sampling algorithm in
either synchronous or asynchronous parallel settings yields a surprisingly
powerful result: making $n$ evaluations distributed among $M$ workers is
essentially equivalent to performing $n$ evaluations in sequence. Further, by
modeling the time taken to complete a function evaluation, we show that, under
a time constraint, asynchronously parallel TS achieves asymptotically lower
regret than both the synchronous and sequential versions. These results are
complemented by an experimental analysis, showing that asynchronous TS
outperforms a suite of existing parallel BO algorithms in simulations and in a
hyperparameter tuning application in convolutional neural networks. In
addition to these, the proposed procedure is conceptually and computationally
much simpler than existing work for parallel BO.

Generative moment matching network (GMMN) is a deep generative model that
differs from Generative Adversarial Network (GAN) by replacing the
discriminator in GAN with a twosample test based on kernel maximum mean
discrepancy (MMD). Although some theoretical guarantees of MMD have been
studied, the empirical performance of GMMN is still not as competitive as that
of GAN on challenging and large benchmark datasets. The computational
efficiency of GMMN is also less desirable in comparison with GAN, partially due
to its requirement for a rather large batch size during the training. In this
paper, we propose to improve both the model expressiveness of GMMN and its
computational efficiency by introducing adversarial kernel learning techniques,
as the replacement of a fixed Gaussian kernel in the original GMMN. The new
approach combines the key ideas in both GMMN and GAN, hence we name it MMDGAN.
The new distance measure in MMDGAN is a meaningful loss that enjoys the
advantage of weak topology and can be optimized via gradient descent with
relatively small batch sizes. In our evaluation on multiple benchmark datasets,
including MNIST, CIFAR 10, CelebA and LSUN, the performance of MMDGAN
significantly outperforms GMMN, and is competitive with other representative
GAN works.

Largescale kernel approximation is an important problem in machine learning
research. Approaches using random Fourier features have become increasingly
popular [Rahimi and Recht, 2007], where kernel approximation is treated as
empirical mean estimation via Monte Carlo (MC) or QuasiMonte Carlo (QMC)
integration [Yang et al., 2014]. A limitation of the current approaches is that
all the features receive an equal weight summing to 1. In this paper, we
propose a novel shrinkage estimator from "Stein effect", which provides a
datadriven weighting strategy for random features and enjoys theoretical
justifications in terms of lowering the empirical risk. We further present an
efficient randomized algorithm for largescale applications of the proposed
method. Our empirical results on six benchmark data sets demonstrate the
advantageous performance of this approach over representative baselines in both
kernel approximation and supervised learning tasks.

While deep learning methods have achieved stateoftheart performance in
many challenging inverse problems like image inpainting and superresolution,
they invariably involve problemspecific training of the networks. Under this
approach, different problems require different networks. In scenarios where we
need to solve a wide variety of problems, e.g., on a mobile camera, it is
inefficient and costly to use these speciallytrained networks. On the other
hand, traditional methods using signal priors can be used in all linear inverse
problems but often have worse performance on challenging tasks. In this work,
we provide a middle ground between the two kinds of methods  we propose a
general framework to train a single deep neural network that solves arbitrary
linear inverse problems. The proposed network acts as a proximal operator for
an optimization algorithm and projects nonimage signals onto the set of
natural images defined by the decision boundary of a classifier. In our
experiments, the proposed framework demonstrates superior performance over
traditional methods using a wavelet sparsity prior and achieves comparable
performance of speciallytrained networks on tasks including compressive
sensing and pixelwise inpainting.

Bandit methods for blackbox optimisation, such as Bayesian optimisation, are
used in a variety of applications including hyperparameter tuning and
experiment design. Recently, \emph{multifidelity} methods have garnered
considerable attention since function evaluations have become increasingly
expensive in such applications. Multifidelity methods use cheap approximations
to the function of interest to speed up the overall optimisation process.
However, most multifidelity methods assume only a finite number of
approximations. In many practical applications however, a continuous spectrum
of approximations might be available. For instance, when tuning an expensive
neural network, one might choose to approximate the cross validation
performance using less data $N$ and/or few training iterations $T$. Here, the
approximations are best viewed as arising out of a continuous two dimensional
space $(N,T)$. In this work, we develop a Bayesian optimisation method, BOCA,
for this setting. We characterise its theoretical properties and show that it
achieves better regret than than strategies which ignore the approximations.
BOCA outperforms several other baselines in synthetic and real experiments.

Galaxyscale strong gravitational lensing is not only a valuable probe of the
dark matter distribution of massive galaxies, but can also provide valuable
cosmological constraints, either by studying the population of strong lenses or
by measuring time delays in lensed quasars. Due to the rarity of galaxyscale
strongly lensed systems, fast and reliable automated lens finding methods will
be essential in the era of large surveys such as LSST, Euclid, and WFIRST. To
tackle this challenge, we introduce CMU DeepLens, a new fully automated
galaxygalaxy lens finding method based on Deep Learning. This supervised
machine learning approach does not require any tuning after the training step
which only requires realistic image simulations of strongly lensed systems. We
train and validate our model on a set of 20,000 LSSTlike mock observations
including a range of lensed systems of various sizes and signaltonoise ratios
(S/N). We find on our simulated data set that for a rejection rate of
nonlenses of 99%, a completeness of 90% can be achieved for lenses with
Einstein radii larger than 1.4" and S/N larger than 20 on individual $g$band
LSST exposures. Finally, we emphasize the importance of realistically complex
simulations for training such machine learning methods by demonstrating that
the performance of models of significantly different complexities cannot be
distinguished on simpler simulations. We make our code publicly available at
https://github.com/McWilliamsCenter/CMUDeepLens .

Sophisticated gated recurrent neural network architectures like LSTMs and
GRUs have been shown to be highly effective in a myriad of applications. We
develop an ungated unit, the statistical recurrent unit (SRU), that is able to
learn long term dependencies in data by only keeping moving averages of
statistics. The SRU's architecture is simple, ungated, and contains a
comparable number of parameters to LSTMs; yet, SRUs perform favorably to more
sophisticated LSTM and GRU alternatives, often outperforming one or both in
various tasks. We show the efficacy of SRUs as compared to LSTMs and GRUs in an
unbiased manner by optimizing respective architectures' hyperparameters in a
Bayesian optimization scheme for both synthetic and realworld tasks.

We introduce a simple permutation equivariant layer for deep learning with
set structure.This type of layer, obtained by parametersharing, has a simple
implementation and lineartime complexity in the size of each set. We use deep
permutationinvariant networks to perform pointcould classification and
MNISTdigit summation, where in both cases the output is invariant to
permutations of the input. In a semisupervised setting, where the goal is make
predictions for each instance within a set, we demonstrate the usefulness of
this type of layer in setoutlier detection as well as semisupervised learning
with clustering sideinformation.

A common problem in disciplines of applied Statistics research such as
Astrostatistics is of estimating the posterior distribution of relevant
parameters. Typically, the likelihoods for such models are computed via
expensive experiments such as cosmological simulations of the universe. An
urgent challenge in these research domains is to develop methods that can
estimate the posterior with few likelihood evaluations.
In this paper, we study active posterior estimation in a Bayesian setting
when the likelihood is expensive to evaluate. Existing techniques for posterior
estimation are based on generating samples representative of the posterior.
Such methods do not consider efficiency in terms of likelihood evaluations. In
order to be query efficient we treat posterior estimation in an active
regression framework. We propose two myopic query strategies to choose where to
evaluate the likelihood and implement them using Gaussian processes. Via
experiments on a series of synthetic and real examples we demonstrate that our
approach is significantly more query efficient than existing techniques and
other heuristics for posterior estimation.

Understanding the nature of dark energy, the mysterious force driving the
accelerated expansion of the Universe, is a major challenge of modern
cosmology. The next generation of cosmological surveys, specifically designed
to address this issue, rely on accurate measurements of the apparent shapes of
distant galaxies. However, shape measurement methods suffer from various
unavoidable biases and therefore will rely on a precise calibration to meet the
accuracy requirements of the science analysis. This calibration process remains
an open challenge as it requires large sets of high quality galaxy images. To
this end, we study the application of deep conditional generative models in
generating realistic galaxy images. In particular we consider variations on
conditional variational autoencoder and introduce a new adversarial objective
for training of conditional generative networks. Our results suggest a reliable
alternative to the acquisition of expensive high quality observations for
generating the calibration data needed by the next generation of cosmological
surveys.

Restricted Boltzmann Machine (RBM) is a bipartite graphical model that is
used as the building block in energybased deep generative models. Due to
numerical stability and quantifiability of the likelihood, RBM is commonly used
with Bernoulli units. Here, we consider an alternative member of exponential
family RBM with leaky rectified linear units  called leaky RBM. We first
study the joint and marginal distributions of leaky RBM under different
leakiness, which provides us important insights by connecting the leaky RBM
model and truncated Gaussian distributions. The connection leads us to a simple
yet efficient method for sampling from this model, where the basic idea is to
anneal the leakiness rather than the energy;  i.e., start from a fully
Gaussian/Linear unit and gradually decrease the leakiness over iterations. This
serves as an alternative to the annealing of the temperature parameter and
enables numerical estimation of the likelihood that are more efficient and more
accurate than the commonly used annealed importance sampling (AIS). We further
demonstrate that the proposed sampling algorithm enjoys faster mixing property
than contrastive divergence algorithm, which benefits the training without any
additional computational cost.