
Recent years have witnessed significant progresses in deep Reinforcement
Learning (RL). Empowered with large scale neural networks, carefully designed
architectures, novel training algorithms and massively parallel computing
devices, researchers are able to attack many challenging RL problems. However,
in machine learning, more training power comes with a potential risk of more
overfitting. As deep RL techniques are being applied to critical problems such
as healthcare and finance, it is important to understand the generalization
behaviors of the trained agents. In this paper, we conduct a systematic study
of standard RL agents and find that they could overfit in various ways.
Moreover, overfitting could happen "robustly": commonly used techniques in RL
that add stochasticity do not necessarily prevent or detect overfitting. In
particular, the same agents and learning algorithms could have drastically
different test performance, even when all of them achieve optimal rewards
during training. The observations call for more principled and careful
evaluation protocols in RL. We conclude with a general discussion on
overfitting in RL and a study of the generalization behaviors from the
perspective of inductive bias.

Advances in deep generative networks have led to impressive results in recent
years. Nevertheless, such models can often waste their capacity on the minutiae
of datasets, presumably due to weak inductive biases in their decoders. This is
where graphics engines may come in handy since they abstract away lowlevel
details and represent images as highlevel programs. Current methods that
combine deep learning and renderers are limited by handcrafted likelihood or
distance functions, a need for large amounts of supervision, or difficulties in
scaling their inference algorithms to richer datasets. To mitigate these
issues, we present SPIRAL, an adversarially trained agent that generates a
program which is executed by a graphics engine to interpret and sample images.
The goal of this agent is to fool a discriminator network that distinguishes
between real and rendered data, trained with a distributed reinforcement
learning setup without any supervision. A surprising finding is that using the
discriminator's output as a reward signal is the key to allow the agent to make
meaningful progress at matching the desired output rendering. To the best of
our knowledge, this is the first demonstration of an endtoend, unsupervised
and adversarial inverse graphics agent on challenging real world (MNIST,
Omniglot, CelebA) and synthetic 3D datasets.

In this work we explore a straightforward variational Bayes scheme for
Recurrent Neural Networks. Firstly, we show that a simple adaptation of
truncated backpropagation through time can yield good quality uncertainty
estimates and superior regularisation at only a small extra computational cost
during training, also reducing the amount of parameters by 80\%. Secondly, we
demonstrate how a novel kind of posterior approximation yields further
improvements to the performance of Bayesian RNNs. We incorporate local gradient
information into the approximate posterior to sharpen it around the current
batch statistics. We show how this technique is not exclusive to recurrent
neural networks and can be applied more widely to train Bayesian neural
networks. We also empirically demonstrate how Bayesian RNNs are superior to
traditional RNNs on a language modelling benchmark and an image captioning
task, as well as showing how each of these methods improve our model over a
variety of other schemes for training them. We also introduce a new benchmark
for studying uncertainty for language models so future methods can be easily
compared.

Graphs are fundamental data structures which concisely capture the relational
structure in many important realworld domains, such as knowledge graphs,
physical and social interactions, language, and chemistry. Here we introduce a
powerful new approach for learning generative models over graphs, which can
capture both their structure and attributes. Our approach uses graph neural
networks to express probabilistic dependencies among a graph's nodes and edges,
and can, in principle, learn distributions over any arbitrary graph. In a
series of experiments our results show that once trained, our models can
generate good quality samples of both synthetic graphs as well as real
molecular graphs, both unconditionally and conditioned on data. Compared to
baselines that do not use graphstructured representations, our models often
perform far better. We also explore key challenges of learning generative
models of graphs, such as how to handle symmetries and ordering of elements
during the graph generation process, and offer possible solutions. Our work is
the first and most general approach for learning generative models over
arbitrary graphs, and opens new directions for moving away from restrictions of
vector and sequencelike knowledge representations, toward more expressive and
flexible relational data structures.

Deep neural networks have excelled on a wide range of problems, from vision
to language and game playing. Neural networks very gradually incorporate
information into weights as they process data, requiring very low learning
rates. If the training distribution shifts, the network is slow to adapt, and
when it does adapt, it typically performs badly on the training distribution
before the shift. Our method, Memorybased Parameter Adaptation, stores
examples in memory and then uses a contextbased lookup to directly modify the
weights of a neural network. Much higher learning rates can be used for this
local adaptation, reneging the need for many iterations over similar data
before good predictions can be made. As our method is memorybased, it
alleviates several shortcomings of neural networks, such as catastrophic
forgetting, fast, stable acquisition of new knowledge, learning with an
imbalanced class labels, and fast learning during evaluation. We demonstrate
this on a range of supervised tasks: largescale image classification and
language modelling.

Deep autoregressive models have shown stateoftheart performance in density
estimation for natural images on largescale datasets such as ImageNet.
However, such models require many thousands of gradientbased weight updates
and unique image examples for training. Ideally, the models would rapidly learn
visual concepts from only a handful of examples, similar to the manner in which
humans learns across many vision tasks. In this paper, we show how 1) neural
attention and 2) meta learning techniques can be used in combination with
autoregressive models to enable effective fewshot density estimation. Our
proposed modifications to PixelCNN result in stateofthe art fewshot density
estimation on the Omniglot dataset. Furthermore, we visualize the learned
attention policy and find that it learns intuitive algorithms for simple tasks
such as image mirroring on ImageNet and handwriting on Omniglot without
supervision. Finally, we extend the model to natural images and demonstrate
fewshot image generation on the Stanford Online Products dataset.

We explore efficient neural architecture search methods and show that a
simple yet powerful evolutionary algorithm can discover new architectures with
excellent performance. Our approach combines a novel hierarchical genetic
representation scheme that imitates the modularized design pattern commonly
adopted by human experts, and an expressive search space that supports complex
topologies. Our algorithm efficiently discovers architectures that outperform a
large number of manually designed models for image classification, obtaining
top1 error of 3.6% on CIFAR10 and 20.3% when transferred to ImageNet, which
is competitive with the best existing neural architecture search approaches. We
also present results using random search, achieving 0.3% less top1 accuracy on
CIFAR10 and 0.1% less on ImageNet whilst reducing the search time from 36
hours down to 1 hour.

We introduce ImaginationAugmented Agents (I2As), a novel architecture for
deep reinforcement learning combining modelfree and modelbased aspects. In
contrast to most existing modelbased reinforcement learning and planning
methods, which prescribe how a model should be used to arrive at a policy, I2As
learn to interpret predictions from a learned environment model to construct
implicit plans in arbitrary ways, by using the predictions as additional
context in deep policy networks. I2As show improved data efficiency,
performance, and robustness to model misspecification compared to several
baselines.

Planning problems are among the most important and wellstudied problems in
artificial intelligence. They are most typically solved by tree search
algorithms that simulate ahead into the future, evaluate future states, and
backup those evaluations to the root of a search tree. Among these algorithms,
MonteCarlo tree search (MCTS) is one of the most general, powerful and widely
used. A typical implementation of MCTS uses cleverly designed rules, optimized
to the particular characteristics of the domain. These rules control where the
simulation traverses, what to evaluate in the states that are reached, and how
to backup those evaluations. In this paper we instead learn where, what and
how to search. Our architecture, which we call an MCTSnet, incorporates
simulationbased search inside a neural network, by expanding, evaluating and
backingup a vector embedding. The parameters of the network are trained
endtoend using gradientbased optimisation. When applied to small searches in
the well known planning problem Sokoban, the learned search algorithm
significantly outperformed MCTS baselines.

Learning from a few examples remains a key challenge in machine learning.
Despite recent advances in important domains such as vision and language, the
standard supervised deep learning paradigm does not offer a satisfactory
solution for learning new concepts rapidly from little data. In this work, we
employ ideas from metric learning based on deep neural features and from recent
advances that augment neural networks with external memories. Our framework
learns a network that maps a small labelled support set and an unlabelled
example to its label, obviating the need for finetuning to adapt to new class
types. We then define oneshot learning problems on vision (using Omniglot,
ImageNet) and language tasks. Our algorithm improves oneshot accuracy on
ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to
competing approaches. We also demonstrate the usefulness of the same model on
language modeling by introducing a oneshot task on the Penn Treebank.

This paper introduces SC2LE (StarCraft II Learning Environment), a
reinforcement learning environment based on the StarCraft II game. This domain
poses a new grand challenge for reinforcement learning, representing a more
difficult class of problems than considered in most prior work. It is a
multiagent problem with multiple players interacting; there is imperfect
information due to a partially observed map; it has a large action space
involving the selection and control of hundreds of units; it has a large state
space that must be observed solely from raw input feature planes; and it has
delayed credit assignment requiring longterm strategies over thousands of
steps. We describe the observation, action, and reward specification for the
StarCraft II domain and provide an open source Pythonbased interface for
communicating with the game engine. In addition to the main game maps, we
provide a suite of minigames focusing on different elements of StarCraft II
gameplay. For the main game maps, we also provide an accompanying dataset of
game replay data from human expert players. We give initial baseline results
for neural networks trained from this data to predict game outcomes and player
actions. Finally, we present initial baseline results for canonical deep
reinforcement learning agents applied to the StarCraft II domain. On the
minigames, these agents learn to achieve a level of play that is comparable to
a novice player. However, when trained on the main game, these agents are
unable to make significant progress. Thus, SC2LE offers a new and challenging
environment for exploring deep reinforcement learning algorithms and
architectures.

Conventional wisdom holds that modelbased planning is a powerful approach to
sequential decisionmaking. It is often very challenging in practice, however,
because while a model can be used to evaluate a plan, it does not prescribe how
to construct a plan. Here we introduce the "Imaginationbased Planner", the
first modelbased, sequential decisionmaking agent that can learn to
construct, evaluate, and execute plans. Before any action, it can perform a
variable number of imagination steps, which involve proposing an imagined
action and evaluating it with its modelbased imagination. All imagined actions
and outcomes are aggregated, iteratively, into a "plan context" which
conditions future real and imagined actions. The agent can even decide how to
imagine: testing out alternative imagined actions, chaining sequences of
actions together, or building a more complex "imagination tree" by navigating
flexibly among the previously imagined states using a learned policy. And our
agent can learn to plan economically, jointly optimizing for external rewards
and computational costs associated with using its imagination. We show that our
architecture can learn to solve a challenging continuous control problem, and
also learn elaborate planning strategies in a discrete mazesolving task. Our
work opens a new direction toward learning the components of a modelbased
planning system and how to use them.

Training directed neural networks typically requires forwardpropagating data
through a computation graph, followed by backpropagating error signal, to
produce weight updates. All layers, or more generally, modules, of the network
are therefore locked, in the sense that they must wait for the remainder of the
network to execute forwards and propagate error backwards before they can be
updated. In this work we break this constraint by decoupling modules by
introducing a model of the future computation of the network graph. These
models predict what the result of the modelled subgraph will produce using only
local information. In particular we focus on modelling error gradients: by
using the modelled synthetic gradient in place of true backpropagated error
gradients we decouple subgraphs, and can update them independently and
asynchronously i.e. we realise decoupled neural interfaces. We show results for
feedforward models, where every layer is trained asynchronously, recurrent
neural networks (RNNs) where predicting one's future gradient extends the time
over which the RNN can effectively model, and also a hierarchical RNN system
with ticking at different timescales. Finally, we demonstrate that in addition
to predicting gradients, the same framework can be used to predict inputs,
resulting in models which are decoupled in both the forward and backwards pass
 amounting to independent networks which colearn such that they can be
composed into a single functioning corporation.

Supervised learning on molecules has incredible potential to be useful in
chemistry, drug discovery, and materials science. Luckily, several promising
and closely related neural network models invariant to molecular symmetries
have already been described in the literature. These models learn a message
passing algorithm and aggregation procedure to compute a function of their
entire input graph. At this point, the next step is to find a particularly
effective variant of this general approach and apply it to chemical prediction
benchmarks until we either solve them or reach the limits of the approach. In
this paper, we reformulate existing models into a single common framework we
call Message Passing Neural Networks (MPNNs) and explore additional novel
variations within this framework. Using MPNNs we demonstrate state of the art
results on an important molecular property prediction benchmark; these results
are strong enough that we believe future work should focus on datasets with
larger molecules or more accurate ground truth labels.

We investigate the impact of choosing regressors and molecular
representations for the construction of fast machine learning (ML) models of
thirteen electronic groundstate properties of organic molecules. The
performance of each regressor/representation/property combination is assessed
using learning curves which report outofsample errors as a function of
training set size with up to $\sim$117k distinct molecules. Molecular
structures and properties at hybrid density functional theory (DFT) level of
theory used for training and testing come from the QM9 database [Ramakrishnan
et al, {\em Scientific Data} {\bf 1} 140022 (2014)] and include dipole moment,
polarizability, HOMO/LUMO energies and gap, electronic spatial extent, zero
point vibrational energy, enthalpies and free energies of atomization, heat
capacity and the highest fundamental vibrational frequency. Various
representations from the literature have been studied (Coulomb matrix, bag of
bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed
distribution based variants including histograms of distances (HD), and angles
(HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian
ridge regression (BR) and linear regression with elastic net regularization
(EN)), random forest (RF), kernel ridge regression (KRR) and two types of
neural net works, graph convolutions (GC) and gated graph networks (GG). We
present numerical evidence that ML model predictions deviate from DFT less than
DFT deviates from experiment for all properties. Furthermore, our outofsample
prediction errors with respect to hybrid DFT reference are on par with, or
close to, chemical accuracy. Our findings suggest that ML models could be more
accurate than hybrid DFT if explicitly electron correlated quantum (or
experimental) data was available.

Many machine learning systems are built to solve the hardest examples of a
particular task, which often makes them large and expensive to runespecially
with respect to the easier examples, which might require much less computation.
For an agent with a limited computational budget, this "onesizefitsall"
approach may result in the agent wasting valuable computation on easy examples,
while not spending enough on hard examples. Rather than learning a single,
fixed policy for solving all instances of a task, we introduce a metacontroller
which learns to optimize a sequence of "imagined" internal simulations over
predictive models of the world in order to construct a more informed, and more
economical, solution. The metacontroller component is a modelfree
reinforcement learning agent, which decides both how many iterations of the
optimization procedure to run, as well as which model to consult on each
iteration. The models (which we call "experts") can be state transition models,
actionvalue functions, or any other mechanism that provides information useful
for solving the task, and can be learned onpolicy or offpolicy in parallel
with the metacontroller. When the metacontroller, controller, and experts were
trained with "interaction networks" (Battaglia et al., 2016) as expert models,
our approach was able to solve a challenging decisionmaking problem under
complex nonlinear dynamics. The metacontroller learned to adapt the amount of
computation it performed to the difficulty of the task, and learned how to
choose which experts to consult by factoring in both their reliability and
individual computational resource costs. This allowed the metacontroller to
achieve a lower overall cost (task loss plus computational cost) than more
traditional fixed policy approaches. These results demonstrate that our
approach is a powerful framework for using...

Deep reinforcement learning methods attain superhuman performance in a wide
range of environments. Such methods are grossly inefficient, often taking
orders of magnitudes more data than humans to achieve reasonable performance.
We propose Neural Episodic Control: a deep reinforcement learning agent that is
able to rapidly assimilate new experiences and act upon them. Our agent uses a
semitabular representation of the value function: a buffer of past experience
containing slowly changing state representations and rapidly updated estimates
of the value function. We show across a wide range of environments that our
agent learns significantly faster than other stateoftheart, general purpose
deep reinforcement learning agents.

When training neural networks, the use of Synthetic Gradients (SG) allows
layers or modules to be trained without update locking  without waiting for a
true error gradient to be backpropagated  resulting in Decoupled Neural
Interfaces (DNIs). This unlocked ability of being able to update parts of a
neural network asynchronously and with only local information was demonstrated
to work empirically in Jaderberg et al (2016). However, there has been very
little demonstration of what changes DNIs and SGs impose from a functional,
representational, and learning dynamics point of view. In this paper, we study
DNIs through the use of synthetic gradients on feedforward networks to better
understand their behaviour and elucidate their effect on optimisation. We show
that the incorporation of SGs does not affect the representational strength of
the learning system for a neural network, and prove the convergence of the
learning system for linear and deep linear models. On practical problems we
investigate the mechanism by which synthetic gradient estimators approximate
the true loss, and, surprisingly, how that leads to drastically different
layerwise representations. Finally, we also expose the relationship of using
synthetic gradients to other error approximation techniques and find a unifying
language for discussion and comparison.

Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance.
Conventional wisdom attributes small generalization error either to properties
of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice. Specifically, our experiments establish that stateoftheart
convolutional networks for image classification trained with stochastic
gradient methods easily fit a random labeling of the training data. This
phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise. We
corroborate these experimental findings with a theoretical construction showing
that simple depth two neural networks already have perfect finite sample
expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.

The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an openworld problem  unconstrained natural language sentences,
and in the wild videos.
Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS)
network that learns to transcribe videos of mouth motion to characters; (2) a
curriculum learning strategy to accelerate training and to reduce overfitting;
(3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition,
consisting of over 100,000 natural sentences from British television.
The WLAS model trained on the LRS dataset surpasses the performance of all
previous work on standard lip reading benchmark datasets, often by a
significant margin. This lip reading performance beats a professional lip
reader on videos from BBC television, and we also demonstrate that visual
information helps to improve speech recognition performance even when the audio
is available.

The recent application of RNN encoderdecoder models has resulted in
substantial progress in fully datadriven dialogue systems, but evaluation
remains a challenge. An adversarial loss could be a way to directly evaluate
the extent to which generated dialogue responses sound like they came from a
human. This could reduce the need for human evaluation, while more directly
evaluating on a generative task. In this work, we investigate this idea by
training an RNN to discriminate a dialogue model's samples from humangenerated
samples. Although we find some evidence this setup could be viable, we also
note that many issues remain in its practical application. We discuss both
aspects and conclude that future work is warranted.

Both generative adversarial networks (GAN) in unsupervised learning and
actorcritic methods in reinforcement learning (RL) have gained a reputation
for being difficult to optimize. Practitioners in both fields have amassed a
large number of strategies to mitigate these instabilities and improve
training. Here we show that GANs can be viewed as actorcritic methods in an
environment where the actor cannot affect the reward. We review the strategies
for stabilizing training for each class of models, both those that generalize
between the two and those that are particular to that model. We also review a
number of extensions to GANs and RL algorithms with even more complicated
information flow. We hope that by highlighting this formal connection we will
encourage both GAN and RL communities to develop general, scalable, and stable
algorithms for multilevel optimization with deep networks, and to draw
inspiration across communities.

We introduce a new neural architecture to learn the conditional probability
of an output sequence with elements that are discrete tokens corresponding to
positions in an input sequence. Such problems cannot be trivially addressed by
existent approaches such as sequencetosequence and Neural Turing Machines,
because the number of target classes in each step of the output depends on the
length of the input, which is variable. Problems such as sorting variable sized
sequences, and various combinatorial optimization problems belong to this
class. Our model solves the problem of variable size output dictionaries using
a recently proposed mechanism of neural attention. It differs from the previous
attention attempts in that, instead of using attention to blend hidden units of
an encoder to a context vector at each decoder step, it uses attention as a
pointer to select a member of the input sequence as the output. We call this
architecture a Pointer Net (PtrNet). We show PtrNets can be used to learn
approximate solutions to three challenging geometric problems  finding planar
convex hulls, computing Delaunay triangulations, and the planar Travelling
Salesman Problem  using training examples alone. PtrNets not only improve
over sequencetosequence with input attention, but also allow us to generalize
to variable size output dictionaries. We show that the learnt models generalize
beyond the maximum lengths they were trained on. We hope our results on these
tasks will encourage a broader exploration of neural learning for discrete
problems.

Neural Machine Translation (NMT) is an endtoend learning approach for
automated translation, with the potential to overcome many of the weaknesses of
conventional phrasebased translation systems. Unfortunately, NMT systems are
known to be computationally expensive both in training and in translation
inference. Also, most NMT systems have difficulty with rare words. These issues
have hindered NMT's use in practical deployments and services, where both
accuracy and speed are essential. In this work, we present GNMT, Google's
Neural Machine Translation system, which attempts to address many of these
issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder
layers using attention and residual connections. To improve parallelism and
therefore decrease training time, our attention mechanism connects the bottom
layer of the decoder to the top layer of the encoder. To accelerate the final
translation speed, we employ lowprecision arithmetic during inference
computations. To improve handling of rare words, we divide words into a limited
set of common subword units ("wordpieces") for both input and output. This
method provides a good balance between the flexibility of "character"delimited
models and the efficiency of "word"delimited models, naturally handles
translation of rare words, and ultimately improves the overall accuracy of the
system. Our beam search technique employs a lengthnormalization procedure and
uses a coverage penalty, which encourages generation of an output sentence that
is most likely to cover all the words in the source sentence. On the WMT'14
EnglishtoFrench and EnglishtoGerman benchmarks, GNMT achieves competitive
results to stateoftheart. Using a human sidebyside evaluation on a set of
isolated simple sentences, it reduces translation errors by an average of 60%
compared to Google's phrasebased production system.

We propose a probabilistic video model, the Video Pixel Network (VPN), that
estimates the discrete joint distribution of the raw pixel values in a video.
The model and the neural architecture reflect the time, space and color
structure of video tensors and encode it as a fourdimensional dependency
chain. The VPN approaches the best possible performance on the Moving MNIST
benchmark, a leap over the previous state of the art, and the generated videos
show only minor deviations from the ground truth. The VPN also produces
detailed samples on the actionconditional Robotic Pushing benchmark and
generalizes to the motion of novel objects.