
This paper introduces Associative Compression Networks (ACNs), a new
framework for variational autoencoding with neural networks. The system differs
from existing variational autoencoders (VAEs) in that the prior distribution
used to model each code is conditioned on a similar code from the dataset. In
compression terms this equates to sequentially transmitting the dataset using
an ordering determined by proximity in latent space. Since the prior need only
account for local, rather than global variations in the latent space, the
coding cost is greatly reduced, leading to rich, informative codes. Crucially,
the codes remain informative when powerful, autoregressive decoders are used,
which we argue is fundamentally difficult with normal VAEs. Experimental
results on MNIST, CIFAR10, ImageNet and CelebA show that ACNs discover
highlevel latent features such as object class, writing style, pose and facial
expression, which can be used to cluster and classify the data, as well as to
generate diverse and convincing samples. We conclude that ACNs are a promising
new direction for representation learning: one that steps away from IID
modelling, and towards learning a structured description of the dataset as a
whole.

We present an endtoend trained memory system that quickly adapts to new
data and generates samples like them. Inspired by Kanerva's sparse distributed
memory, it has a robust distributed reading and writing mechanism. The memory
is analytically tractable, which enables optimal online compression via a
Bayesian updaterule. We formulate it as a hierarchical conditional generative
model, where memory provides a rich datadependent prior distribution.
Consequently, the topdown memory and bottomup perception are combined to
produce the code representing an observation. Empirically, we demonstrate that
the adaptive memory significantly improves generative models trained on both
the Omniglot and CIFAR datasets. Compared with the Differentiable Neural
Computer (DNC) and its variants, our memory model has greater capacity and is
significantly easier to train.

We introduce NoisyNet, a deep reinforcement learning agent with parametric
noise added to its weights, and show that the induced stochasticity of the
agent's policy can be used to aid efficient exploration. The parameters of the
noise are learned with gradient descent along with the remaining network
weights. NoisyNet is straightforward to implement and adds little computational
overhead. We find that replacing the conventional exploration heuristics for
A3C, DQN and dueling agents (entropy reward and $\epsilon$greedy respectively)
with NoisyNet yields substantially higher scores for a wide range of Atari
games, in some cases advancing the agent from sub to superhuman performance.

Training directed neural networks typically requires forwardpropagating data
through a computation graph, followed by backpropagating error signal, to
produce weight updates. All layers, or more generally, modules, of the network
are therefore locked, in the sense that they must wait for the remainder of the
network to execute forwards and propagate error backwards before they can be
updated. In this work we break this constraint by decoupling modules by
introducing a model of the future computation of the network graph. These
models predict what the result of the modelled subgraph will produce using only
local information. In particular we focus on modelling error gradients: by
using the modelled synthetic gradient in place of true backpropagated error
gradients we decouple subgraphs, and can update them independently and
asynchronously i.e. we realise decoupled neural interfaces. We show results for
feedforward models, where every layer is trained asynchronously, recurrent
neural networks (RNNs) where predicting one's future gradient extends the time
over which the RNN can effectively model, and also a hierarchical RNN system
with ticking at different timescales. Finally, we demonstrate that in addition
to predicting gradients, the same framework can be used to predict inputs,
resulting in models which are decoupled in both the forward and backwards pass
 amounting to independent networks which colearn such that they can be
composed into a single functioning corporation.

We introduce a method for automatically selecting the path, or syllabus, that
a neural network follows through a curriculum so as to maximise learning
efficiency. A measure of the amount that the network learns from each data
sample is provided as a reward signal to a nonstationary multiarmed bandit
algorithm, which then determines a stochastic syllabus. We consider a range of
signals derived from two distinct indicators of learning progress: rate of
increase in prediction accuracy, and rate of increase in network complexity.
Experimental results for LSTM networks on three curricula demonstrate that our
approach can significantly accelerate learning, in some cases halving the time
required to attain a satisfactory performance level.

We present a novel neural network for processing sequences. The ByteNet is a
onedimensional convolutional neural network that is composed of two parts, one
to encode the source sequence and the other to decode the target sequence. The
two network parts are connected by stacking the decoder on top of the encoder
and preserving the temporal resolution of the sequences. To address the
differing lengths of the source and the target, we introduce an efficient
mechanism by which the decoder is dynamically unfolded over the representation
of the encoder. The ByteNet uses dilation in the convolutional layers to
increase its receptive field. The resulting network has two core properties: it
runs in time that is linear in the length of the sequences and it sidesteps the
need for excessive memorization. The ByteNet decoder attains stateoftheart
performance on characterlevel language modelling and outperforms the previous
best results obtained with recurrent networks. The ByteNet also achieves
stateoftheart performance on charactertocharacter machine translation on
the EnglishtoGerman WMT translation task, surpassing comparable neural
translation models that are based on recurrent networks with attentional
pooling and run in quadratic time. We find that the latent alignment structure
contained in the representations reflects the expected alignment between the
tokens.

This paper introduces Adaptive Computation Time (ACT), an algorithm that
allows recurrent neural networks to learn how many computational steps to take
between receiving an input and emitting an output. ACT requires minimal changes
to the network architecture, is deterministic and differentiable, and does not
add any noise to the parameter gradients. Experimental results are provided for
four synthetic problems: determining the parity of binary vectors, applying
binary logic operations, adding integers, and sorting real numbers. Overall,
performance is dramatically improved by the use of ACT, which successfully
adapts the number of computational steps to the requirements of the problem. We
also present characterlevel language modelling results on the Hutter prize
Wikipedia dataset. In this case ACT does not yield large gains in performance;
however it does provide intriguing insight into the structure of the data, with
more computation allocated to hardertopredict transitions, such as spaces
between words and ends of sentences. This suggests that ACT or other adaptive
computation methods could provide a generic method for inferring segment
boundaries in sequence data.

Neural networks augmented with external memory have the ability to learn
algorithmic solutions to complex tasks. These models appear promising for
applications such as language modeling and machine translation. However, they
scale poorly in both space and time as the amount of memory grows  limiting
their applicability to realworld domains. Here, we present an endtoend
differentiable memory access scheme, which we call Sparse Access Memory (SAM),
that retains the representational power of the original approaches whilst
training efficiently with very large memories. We show that SAM achieves
asymptotic lower bounds in space and time complexity, and find that an
implementation runs $1,\!000\times$ faster and with $3,\!000\times$ less
physical memory than nonsparse models. SAM learns with comparable data
efficiency to existing models on a range of synthetic tasks and oneshot
Omniglot character recognition, and can scale to tasks requiring $100,\!000$s
of time steps and memories. As well, we show how our approach can be adapted
for models that maintain temporal associations between memories, as with the
recently introduced Differentiable Neural Computer.

We propose a probabilistic video model, the Video Pixel Network (VPN), that
estimates the discrete joint distribution of the raw pixel values in a video.
The model and the neural architecture reflect the time, space and color
structure of video tensors and encode it as a fourdimensional dependency
chain. The VPN approaches the best possible performance on the Moving MNIST
benchmark, a leap over the previous state of the art, and the generated videos
show only minor deviations from the ground truth. The VPN also produces
detailed samples on the actionconditional Robotic Pushing benchmark and
generalizes to the motion of novel objects.

This paper introduces WaveNet, a deep neural network for generating raw audio
waveforms. The model is fully probabilistic and autoregressive, with the
predictive distribution for each audio sample conditioned on all previous ones;
nonetheless we show that it can be efficiently trained on data with tens of
thousands of samples per second of audio. When applied to texttospeech, it
yields stateoftheart performance, with human listeners rating it as
significantly more natural sounding than the best parametric and concatenative
systems for both English and Mandarin. A single WaveNet can capture the
characteristics of many different speakers with equal fidelity, and can switch
between them by conditioning on the speaker identity. When trained to model
music, we find that it generates novel and often highly realistic musical
fragments. We also show that it can be employed as a discriminative model,
returning promising results for phoneme recognition.

The ability to backpropagate stochastic gradients through continuous latent
distributions has been crucial to the emergence of variational autoencoders and
stochastic gradient variational Bayes. The key ingredient is an unbiased and
lowvariance way of estimating gradients with respect to distribution
parameters from gradients evaluated at distribution samples. The
"reparameterization trick" provides a class of transforms yielding such
estimators for many continuous distributions, including the Gaussian and other
members of the locationscale family. However the trick does not readily extend
to mixture density models, due to the difficulty of reparameterizing the
discrete distribution over mixture weights. This report describes an
alternative transform, applicable to any continuous multivariate distribution
with a differentiable density function from which samples can be drawn, and
uses it to derive an unbiased estimator for mixture density weight derivatives.
Combined with the reparameterization trick applied to the individual mixture
components, this estimator makes it straightforward to train variational
autoencoders with mixturedistributed latent variables, or to perform
stochastic variational inference with a mixture density variational posterior.

This work explores conditional image generation with a new image density
model based on the PixelCNN architecture. The model can be conditioned on any
vector, including descriptive labels or tags, or latent embeddings created by
other networks. When conditioned on class labels from the ImageNet database,
the model is able to generate diverse, realistic scenes representing distinct
animals, objects, landscapes and structures. When conditioned on an embedding
produced by a convolutional network given a single image of an unseen face, it
generates a variety of new portraits of the same person with different facial
expressions, poses and lighting conditions. We also show that conditional
PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally,
the gated convolutional layers in the proposed model improve the loglikelihood
of PixelCNN to match the stateoftheart performance of PixelRNN on ImageNet,
with greatly reduced computational cost.

We propose a conceptually simple and lightweight framework for deep
reinforcement learning that uses asynchronous gradient descent for optimization
of deep neural network controllers. We present asynchronous variants of four
standard reinforcement learning algorithms and show that parallel
actorlearners have a stabilizing effect on training allowing all four methods
to successfully train neural network controllers. The best performing method,
an asynchronous variant of actorcritic, surpasses the current stateoftheart
on the Atari domain while training for half the time on a single multicore CPU
instead of a GPU. Furthermore, we show that asynchronous actorcritic succeeds
on a wide variety of continuous motor control problems as well as on a new task
of navigating random 3D mazes using a visual input.

We present a novel deep recurrent neural network architecture that learns to
build implicit plans in an endtoend manner by purely interacting with an
environment in reinforcement learning setting. The network builds an internal
plan, which is continuously updated upon observation of the next input from the
environment. It can also partition this internal representation into contiguous
sub sequences by learning for how long the plan can be committed to  i.e.
followed without replaning. Combining these properties, the proposed model,
dubbed STRategic Attentive Writer (STRAW) can learn highlevel, temporally
abstracted macro actions of varying lengths that are solely learnt from data
without any prior information. These macroactions enable both structured
exploration and economic computation. We experimentally demonstrate that STRAW
delivers strong improvements on several ATARI games by employing temporally
extended planning strategies (e.g. Ms. Pacman and Frostbite). It is at the same
time a general algorithm that can be applied on any sequence data. To that end,
we also show that when trained on text prediction task, STRAW naturally
predicts frequent ngrams (instead of macroactions), demonstrating the
generality of the approach.

We propose a novel approach to reduce memory consumption of the
backpropagation through time (BPTT) algorithm when training recurrent neural
networks (RNNs). Our approach uses dynamic programming to balance a tradeoff
between caching of intermediate results and recomputation. The algorithm is
capable of tightly fitting within almost any userset memory budget while
finding an optimal execution policy minimizing the computational cost.
Computational devices have limited memory capacity and maximizing a
computational performance given a fixed memory budget is a practical usecase.
We provide asymptotic computational upper bounds for various regimes. The
algorithm is particularly effective for long sequences. For sequences of length
1000, our algorithm saves 95\% of memory usage while using only one third more
time per iteration than the standard BPTT.

We investigate a new method to augment recurrent neural networks with extra
memory without increasing the number of network parameters. The system has an
associative memory based on complexvalued vectors and is closely related to
Holographic Reduced Representations and Long ShortTerm Memory networks.
Holographic Reduced Representations have limited capacity: as they store more
information, each retrieval becomes noisier due to interference. Our system in
contrast creates redundant copies of stored information, which enables
retrieval with reduced noise. Experiments demonstrate faster learning on
multiple memorization tasks.

This paper introduces Grid Long ShortTerm Memory, a network of LSTM cells
arranged in a multidimensional grid that can be applied to vectors, sequences
or higher dimensional data such as images. The network differs from existing
deep LSTM architectures in that the cells are connected between network layers
as well as along the spatiotemporal dimensions of the data. The network
provides a unified way of using LSTM for both deep and sequential computation.
We apply the model to algorithmic tasks such as 15digit integer addition and
sequence memorization, where it is able to significantly outperform the
standard LSTM. We then give results for two empirical tasks. We find that 2D
Grid LSTM achieves 1.47 bits per character on the Wikipedia character
prediction benchmark, which is stateoftheart among neural approaches. In
addition, we use the Grid LSTM to define a novel twodimensional translation
model, the Reencoder, and show that it outperforms a phrasebased reference
system on a ChinesetoEnglish translation task.

This paper introduces the Deep Recurrent Attentive Writer (DRAW) neural
network architecture for image generation. DRAW networks combine a novel
spatial attention mechanism that mimics the foveation of the human eye, with a
sequential variational autoencoding framework that allows for the iterative
construction of complex images. The system substantially improves on the state
of the art for generative models on MNIST, and, when trained on the Street View
House Numbers dataset, it generates images that cannot be distinguished from
real data with the naked eye.

We extend the capabilities of neural networks by coupling them to external
memory resources, which they can interact with by attentional processes. The
combined system is analogous to a Turing Machine or Von Neumann architecture
but is differentiable endtoend, allowing it to be efficiently trained with
gradient descent. Preliminary results demonstrate that Neural Turing Machines
can infer simple algorithms such as copying, sorting, and associative recall
from input and output examples.

Applying convolutional neural networks to large images is computationally
expensive because the amount of computation scales linearly with the number of
image pixels. We present a novel recurrent neural network model that is capable
of extracting information from an image or video by adaptively selecting a
sequence of regions or locations and only processing the selected regions at
high resolution. Like convolutional neural networks, the proposed model has a
degree of translation invariance builtin, but the amount of computation it
performs can be controlled independently of the input image size. While the
model is nondifferentiable, it can be trained using reinforcement learning
methods to learn taskspecific policies. We evaluate our model on several image
classification tasks, where it significantly outperforms a convolutional neural
network baseline on cluttered images, and on a dynamic visual control problem,
where it learns to track a simple object without an explicit training signal
for doing so.

This paper shows how Long Shortterm Memory recurrent neural networks can be
used to generate complex sequences with longrange structure, simply by
predicting one data point at a time. The approach is demonstrated for text
(where the data are discrete) and online handwriting (where the data are
realvalued). It is then extended to handwriting synthesis by allowing the
network to condition its predictions on a text sequence. The resulting system
is able to generate highly realistic cursive handwriting in a wide variety of
styles.

We present the first deep learning model to successfully learn control
policies directly from highdimensional sensory input using reinforcement
learning. The model is a convolutional neural network, trained with a variant
of Qlearning, whose input is raw pixels and whose output is a value function
estimating future rewards. We apply our method to seven Atari 2600 games from
the Arcade Learning Environment, with no adjustment of the architecture or
learning algorithm. We find that it outperforms all previous approaches on six
of the games and surpasses a human expert on three of them.

Recurrent neural networks (RNNs) are a powerful model for sequential data.
Endtoend training methods such as Connectionist Temporal Classification make
it possible to train RNNs for sequence labelling problems where the
inputoutput alignment is unknown. The combination of these methods with the
Long Shortterm Memory RNN architecture has proved particularly fruitful,
delivering stateoftheart results in cursive handwriting recognition. However
RNN performance in speech recognition has so far been disappointing, with
better results returned by deep feedforward networks. This paper investigates
\emph{deep recurrent neural networks}, which combine the multiple levels of
representation that have proved so effective in deep networks with the flexible
use of long range context that empowers RNNs. When trained endtoend with
suitable regularisation, we find that deep Long Shortterm Memory RNNs achieve
a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to
our knowledge is the best recorded score.

Many machine learning tasks can be expressed as the transformationor
\emph{transduction}of input sequences into output sequences: speech
recognition, machine translation, protein secondary structure prediction and
texttospeech to name but a few. One of the key challenges in sequence
transduction is learning to represent both the input and output sequences in a
way that is invariant to sequential distortions such as shrinking, stretching
and translating. Recurrent neural networks (RNNs) are a powerful sequence
learning architecture that has proven capable of learning such representations.
However RNNs traditionally require a predefined alignment between the input
and output sequences to perform transduction. This is a severe limitation since
\emph{finding} the alignment is the most difficult aspect of many sequence
transduction problems. Indeed, even determining the length of the output
sequence is often challenging. This paper introduces an endtoend,
probabilistic sequence transduction system, based entirely on RNNs, that is in
principle able to transform any input sequence into any finite, discrete output
sequence. Experimental results for phoneme recognition are provided on the
TIMIT speech corpus.

We compare the performance of a recurrent neural network with the best
results published so far on phoneme recognition in the TIMIT database. These
published results have been obtained with a combination of classifiers.
However, in this paper we apply a single recurrent neural network to the same
task. Our recurrent neural network attains an error rate of 24.6%. This result
is not significantly different from that obtained by the other best methods,
but they rely on a combination of classifiers for achieving comparable
performance.