
Deep autoregressive models have shown stateoftheart performance in density
estimation for natural images on largescale datasets such as ImageNet.
However, such models require many thousands of gradientbased weight updates
and unique image examples for training. Ideally, the models would rapidly learn
visual concepts from only a handful of examples, similar to the manner in which
humans learns across many vision tasks. In this paper, we show how 1) neural
attention and 2) meta learning techniques can be used in combination with
autoregressive models to enable effective fewshot density estimation. Our
proposed modifications to PixelCNN result in stateofthe art fewshot density
estimation on the Omniglot dataset. Furthermore, we visualize the learned
attention policy and find that it learns intuitive algorithms for simple tasks
such as image mirroring on ImageNet and handwriting on Omniglot without
supervision. Finally, we extend the model to natural images and demonstrate
fewshot image generation on the Stanford Online Products dataset.

We learn recurrent neural network optimizers trained on simple synthetic
functions by gradient descent. We show that these learned optimizers exhibit a
remarkable degree of transfer in that they can be used to efficiently optimize
a broad range of derivativefree blackbox functions, including Gaussian
process bandits, simple control objectives, global optimization benchmarks and
hyperparameter tuning tasks. Up to the training horizon, the learned
optimizers learn to tradeoff exploration and exploitation, and compare
favourably with heavily engineered Bayesian optimization packages for
hyperparameter tuning.

Drawing a sample from a discrete distribution is one of the building
components for Monte Carlo methods. Like other sampling algorithms, discrete
sampling suffers from the high computational burden in largescale inference
problems. We study the problem of sampling a discrete random variable with a
high degree of dependency that is typical in largescale Bayesian inference and
graphical models, and propose an efficient approximate solution with a
subsampling approach. We make a novel connection between the discrete sampling
and MultiArmed Bandits problems with a finite reward population and provide
three algorithms with theoretical guarantees. Empirical evaluations show the
robustness and efficiency of the approximate algorithms in both synthetic and
realworld largescale problems.

Herding defines a deterministic dynamical system at the edge of chaos. It
generates a sequence of model states and parameters by alternating parameter
perturbations with state maximizations, where the sequence of states can be
interpreted as "samples" from an associated MRF model. Herding differs from
maximum likelihood estimation in that the sequence of parameters does not
converge to a fixed point and differs from an MCMC posterior sampling approach
in that the sequence of states is generated deterministically. Herding may be
interpreted as a"perturb and map" method where the parameter perturbations are
generated using a deterministic nonlinear dynamical system rather than randomly
from a Gumbel distribution. This chapter studies the distinct statistical
characteristics of the herding algorithm and shows that the fast convergence
rate of the controlled moments may be attributed to edge of chaos dynamics. The
herding algorithm can also be generalized to models with latent variables and
to a discriminative learning setting. The perceptron cycling theorem ensures
that the fast moment matching property is preserved in the more general
framework.

Probabilistic programming languages can simplify the development of machine
learning techniques, but only if inference is sufficiently scalable.
Unfortunately, Bayesian parameter estimation for highly coupled models such as
regressions and statespace models still scales poorly; each MCMC transition
takes linear time in the number of observations. This paper describes a
sublineartime algorithm for making MetropolisHastings (MH) updates to latent
variables in probabilistic programs. The approach generalizes recently
introduced approximate MH techniques: instead of subsampling data items assumed
to be independent, it subsamples edges in a dynamically constructed graphical
model. It thus applies to a broader class of problems and interoperates with
other generalpurpose inference techniques. Empirical results, including
confirmation of sublinear pertransition scaling, are presented for Bayesian
logistic regression, nonlinear classification via joint Dirichlet process
mixtures, and parameter estimation for stochastic volatility models (with state
estimation via particle MCMC). All three applications use the same
implementation, and each requires under 20 lines of probabilistic code.

Multivariate categorical data occur in many applications of machine learning.
One of the main difficulties with these vectors of categorical variables is
sparsity. The number of possible observations grows exponentially with vector
length, but dataset diversity might be poor in comparison. Recent models have
gained significant improvement in supervised tasks with this data. These models
embed observations in a continuous space to capture similarities between them.
Building on these ideas we propose a Bayesian model for the unsupervised task
of distribution estimation of multivariate categorical data. We model vectors
of categorical variables as generated from a nonlinear transformation of a
continuous latent space. Nonlinearity captures multimodality in the
distribution. The continuous representation addresses sparsity. Our model ties
together many existing models, linking the linear categorical latent Gaussian
model, the Gaussian process latent variable model, and Gaussian process
classification. We derive inference for our model based on recent developments
in sampling based variational inference. We show empirically that the model
outperforms its linear and discrete counterparts in imputation tasks of sparse
data.

Statespace models have been successfully used for more than fifty years in
different areas of science and engineering. We present a procedure for
efficient variational Bayesian learning of nonlinear statespace models based
on sparse Gaussian processes. The result of learning is a tractable posterior
over nonlinear dynamical systems. In comparison to conventional parametric
models, we offer the possibility to straightforwardly trade off model capacity
and computational cost whilst avoiding overfitting. Our main algorithm uses a
hybrid inference approach combining variational Bayes and sequential Monte
Carlo. We also present stochastic variational inference and online learning
approaches for fast learning with long time series.

In recent years a number of methods have been developed for automatically
learning the (sparse) connectivity structure of Markov Random Fields. These
methods are mostly based on L1regularized optimization which has a number of
disadvantages such as the inability to assess model uncertainty and expensive
crossvalidation to find the optimal regularization parameter. Moreover, the
model's predictive performance may degrade dramatically with a suboptimal value
of the regularization parameter (which is sometimes desirable to induce
sparseness). We propose a fully Bayesian approach based on a "spike and slab"
prior (similar to L0 regularization) that does not suffer from these
shortcomings. We develop an approximate MCMC method combining Langevin dynamics
and reversible jump MCMC to conduct inference in this model. Experiments show
that the proposed model learns a good combination of the structure and
parameter values without the need for separate hyperparameter tuning.
Moreover, the model's predictive performance is much more robust than L1based
methods with hyperparameter settings that induce highly sparse model
structures.

Can we make Bayesian posterior MCMC sampling more efficient when faced with
very large datasets? We argue that computing the likelihood for N datapoints in
the MetropolisHastings (MH) test to reach a single binary decision is
computationally inefficient. We introduce an approximate MH rule based on a
sequential hypothesis test that allows us to accept or reject samples with high
confidence using only a fraction of the data required for the exact MH rule.
While this method introduces an asymptotic bias, we show that this bias can be
controlled and is more than offset by a decrease in variance due to our ability
to draw more samples per unit of time.

The Gibbs sampler is one of the most popular algorithms for inference in
statistical models. In this paper, we introduce a herding variant of this
algorithm, called herded Gibbs, that is entirely deterministic. We prove that
herded Gibbs has an $O(1/T)$ convergence rate for models with independent
variables and for fully connected probabilistic graphical models. Herded Gibbs
is shown to outperform Gibbs in the tasks of image denoising with MRFs and
named entity recognition with CRFs. However, the convergence for herded Gibbs
for sparsely connected probabilistic graphical models is still an open problem.

In recent years a number of methods have been developed for automatically
learning the (sparse) connectivity structure of Markov Random Fields. These
methods are mostly based on L1regularized optimization which has a number of
disadvantages such as the inability to assess model uncertainty and expensive
crossvalidation to find the optimal regularization parameter. Moreover, the
model's predictive performance may degrade dramatically with a suboptimal value
of the regularization parameter (which is sometimes desirable to induce
sparseness). We propose a fully Bayesian approach based on a "spike and slab"
prior (similar to L0 regularization) that does not suffer from these
shortcomings. We develop an approximate MCMC method combining Langevin dynamics
and reversible jump MCMC to conduct inference in this model. Experiments show
that the proposed model learns a good combination of the structure and
parameter values without the need for separate hyperparameter tuning.
Moreover, the model's predictive performance is much more robust than L1based
methods with hyperparameter settings that induce highly sparse model
structures.

We extend the herding algorithm to continuous spaces by using the kernel
trick. The resulting "kernel herding" algorithm is an infinite memory
deterministic process that learns to approximate a PDF with a collection of
samples. We show that kernel herding decreases the error of expectations of
functions in the Hilbert space at a rate O(1/T) which is much faster than the
usual O(1/pT) for iid random samples. We illustrate kernel herding by
approximating Bayesian predictive distributions.