
The Wasserstein probability metric has received much attention from the
machine learning community. Unlike the KullbackLeibler divergence, which
strictly measures change in probability, the Wasserstein metric reflects the
underlying geometry between outcomes. The value of being sensitive to this
geometry has been demonstrated, among others, in ordinal regression and
generative modelling. In this paper we describe three natural properties of
probability divergences that reflect requirements from machine learning: sum
invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein
metric possesses the first two properties but, unlike the KullbackLeibler
divergence, does not possess the third. We provide empirical evidence
suggesting that this is a serious issue in practice. Leveraging insights from
probabilistic forecasting we propose an alternative to the Wasserstein metric,
the Cram\'er distance. We show that the Cram\'er distance possesses all three
desired properties, combining the best of the Wasserstein and KullbackLeibler
divergences. To illustrate the relevance of the Cram\'er distance in practice
we design a new algorithm, the Cram\'er Generative Adversarial Network (GAN),
and show that it performs significantly better than the related Wasserstein
GAN.

We train a generator by maximum likelihood and we also train the same
generator architecture by Wasserstein GAN. We then compare the generated
samples, exact logprobability densities and approximate Wasserstein distances.
We show that an independent critic trained to approximate Wasserstein distance
between the validation set and the generator distribution helps detect
overfitting. Finally, we use ideas from the oneshot learning literature to
develop a novel fast learning critic.

Neural networks augmented with external memory have the ability to learn
algorithmic solutions to complex tasks. These models appear promising for
applications such as language modeling and machine translation. However, they
scale poorly in both space and time as the amount of memory grows  limiting
their applicability to realworld domains. Here, we present an endtoend
differentiable memory access scheme, which we call Sparse Access Memory (SAM),
that retains the representational power of the original approaches whilst
training efficiently with very large memories. We show that SAM achieves
asymptotic lower bounds in space and time complexity, and find that an
implementation runs $1,\!000\times$ faster and with $3,\!000\times$ less
physical memory than nonsparse models. SAM learns with comparable data
efficiency to existing models on a range of synthetic tasks and oneshot
Omniglot character recognition, and can scale to tasks requiring $100,\!000$s
of time steps and memories. As well, we show how our approach can be adapted
for models that maintain temporal associations between memories, as with the
recently introduced Differentiable Neural Computer.

We propose a probabilistic video model, the Video Pixel Network (VPN), that
estimates the discrete joint distribution of the raw pixel values in a video.
The model and the neural architecture reflect the time, space and color
structure of video tensors and encode it as a fourdimensional dependency
chain. The VPN approaches the best possible performance on the Moving MNIST
benchmark, a leap over the previous state of the art, and the generated videos
show only minor deviations from the ground truth. The VPN also produces
detailed samples on the actionconditional Robotic Pushing benchmark and
generalizes to the motion of novel objects.

We propose a novel approach to reduce memory consumption of the
backpropagation through time (BPTT) algorithm when training recurrent neural
networks (RNNs). Our approach uses dynamic programming to balance a tradeoff
between caching of intermediate results and recomputation. The algorithm is
capable of tightly fitting within almost any userset memory budget while
finding an optimal execution policy minimizing the computational cost.
Computational devices have limited memory capacity and maximizing a
computational performance given a fixed memory budget is a practical usecase.
We provide asymptotic computational upper bounds for various regimes. The
algorithm is particularly effective for long sequences. For sequences of length
1000, our algorithm saves 95\% of memory usage while using only one third more
time per iteration than the standard BPTT.

Humans have an impressive ability to reason about new concepts and
experiences from just a single example. In particular, humans have an ability
for oneshot generalization: an ability to encounter a new concept, understand
its structure, and then be able to generate compelling alternative variations
of the concept. We develop machine learning systems with this important
capacity by developing new deep generative models, models that combine the
representational power of deep learning with the inferential power of Bayesian
reasoning. We develop a class of sequential generative models that are built on
the principles of feedback and attention. These two characteristics lead to
generative models that are among the stateofthe art in density estimation and
image generation. We demonstrate the oneshot generalization ability of our
models using three tasks: unconditional sampling, generating new exemplars of a
given concept, and generating new exemplars of a family of concepts. In all
cases our models are able to generate compelling and diverse sampleshaving
seen new examples just onceproviding an important class of generalpurpose
models for oneshot machine learning.

We investigate a new method to augment recurrent neural networks with extra
memory without increasing the number of network parameters. The system has an
associative memory based on complexvalued vectors and is closely related to
Holographic Reduced Representations and Long ShortTerm Memory networks.
Holographic Reduced Representations have limited capacity: as they store more
information, each retrieval becomes noisier due to interference. Our system in
contrast creates redundant copies of stored information, which enables
retrieval with reduced noise. Experiments demonstrate faster learning on
multiple memorization tasks.

We introduce a simple recurrent variational autoencoder architecture that
significantly improves image modeling. The system represents the
stateoftheart in latent variable models for both the ImageNet and Omniglot
datasets. We show that it naturally separates global conceptual information
from lower level details, thus addressing one of the fundamentally desired
properties of unsupervised learning. Furthermore, the possibility of
restricting ourselves to storing only global information about an image allows
us to achieve high quality 'conceptual compression'.

This paper introduces Grid Long ShortTerm Memory, a network of LSTM cells
arranged in a multidimensional grid that can be applied to vectors, sequences
or higher dimensional data such as images. The network differs from existing
deep LSTM architectures in that the cells are connected between network layers
as well as along the spatiotemporal dimensions of the data. The network
provides a unified way of using LSTM for both deep and sequential computation.
We apply the model to algorithmic tasks such as 15digit integer addition and
sequence memorization, where it is able to significantly outperform the
standard LSTM. We then give results for two empirical tasks. We find that 2D
Grid LSTM achieves 1.47 bits per character on the Wikipedia character
prediction benchmark, which is stateoftheart among neural approaches. In
addition, we use the Grid LSTM to define a novel twodimensional translation
model, the Reencoder, and show that it outperforms a phrasebased reference
system on a ChinesetoEnglish translation task.

This paper introduces the Deep Recurrent Attentive Writer (DRAW) neural
network architecture for image generation. DRAW networks combine a novel
spatial attention mechanism that mimics the foveation of the human eye, with a
sequential variational autoencoding framework that allows for the iterative
construction of complex images. The system substantially improves on the state
of the art for generative models on MNIST, and, when trained on the Street View
House Numbers dataset, it generates images that cannot be distinguished from
real data with the naked eye.

We extend the capabilities of neural networks by coupling them to external
memory resources, which they can interact with by attentional processes. The
combined system is analogous to a Turing Machine or Von Neumann architecture
but is differentiable endtoend, allowing it to be efficiently trained with
gradient descent. Preliminary results demonstrate that Neural Turing Machines
can infer simple algorithms such as copying, sorting, and associative recall
from input and output examples.

We introduce a deep, generative autoencoder capable of learning hierarchies
of distributed representations from data. Successive deep stochastic hidden
layers are equipped with autoregressive connections, which enable the model to
be sampled from quickly and exactly via ancestral sampling. We derive an
efficient approximate parameter estimation method based on the minimum
description length (MDL) principle, which can be seen as maximising a
variational lower bound on the loglikelihood, with a feedforward neural
network implementing approximate inference. We demonstrate stateoftheart
generative performance on a number of classic data sets: several UCI data sets,
MNIST and Atari 2600 games.

Many reinforcement learning exploration techniques are overly optimistic and
try to explore every state. Such exploration is impossible in environments with
the unlimited number of states. I propose to use simulated exploration with an
optimistic model to discover promising paths for real exploration. This reduces
the needs for the real exploration.