
Applying endtoend learning to solve complex, interactive, pixeldriven
control tasks on a robot is an unsolved problem. Deep Reinforcement Learning
algorithms are too slow to achieve performance on a real robot, but their
potential has been demonstrated in simulated environments. We propose using
progressive networks to bridge the reality gap and transfer learned policies
from simulation to the real world. The progressive net approach is a general
framework that enables reuse of everything from lowlevel visual features to
highlevel policies for transfer to new tasks, enabling a compositional, yet
simple, approach to building complex skills. We present an early demonstration
of this approach with a number of experiments in the domain of robot
manipulation that focus on bridging the reality gap. Unlike other proposed
approaches, our realworld experiments demonstrate successful task learning
from raw visual input on a fully actuated robot manipulator. Moreover, rather
than relying on modelbased trajectory optimisation, the task learning is
accomplished using only deep reinforcement learning and sparse rewards.

Advanced optimization algorithms such as Newton method and AdaGrad benefit
from second order derivative or second order statistics to achieve better
descent directions and faster convergence rates. At their heart, such
algorithms need to compute the inverse or inverse square root of a matrix whose
size is quadratic of the dimensionality of the search space. For high
dimensional search spaces, the matrix inversion or inversion of square root
becomes overwhelming which in turn demands for approximate methods. In this
work, we propose a new matrix approximation method which divides a matrix into
blocks and represents each block by one or two numbers. The method allows
efficient computation of matrix inverse and inverse square root. We apply our
method to AdaGrad in training deep neural networks. Experiments show
encouraging results compared to the diagonal approximation.

Graphs are fundamental data structures which concisely capture the relational
structure in many important realworld domains, such as knowledge graphs,
physical and social interactions, language, and chemistry. Here we introduce a
powerful new approach for learning generative models over graphs, which can
capture both their structure and attributes. Our approach uses graph neural
networks to express probabilistic dependencies among a graph's nodes and edges,
and can, in principle, learn distributions over any arbitrary graph. In a
series of experiments our results show that once trained, our models can
generate good quality samples of both synthetic graphs as well as real
molecular graphs, both unconditionally and conditioned on data. Compared to
baselines that do not use graphstructured representations, our models often
perform far better. We also explore key challenges of learning generative
models of graphs, such as how to handle symmetries and ordering of elements
during the graph generation process, and offer possible solutions. Our work is
the first and most general approach for learning generative models over
arbitrary graphs, and opens new directions for moving away from restrictions of
vector and sequencelike knowledge representations, toward more expressive and
flexible relational data structures.

Deep neural networks have excelled on a wide range of problems, from vision
to language and game playing. Neural networks very gradually incorporate
information into weights as they process data, requiring very low learning
rates. If the training distribution shifts, the network is slow to adapt, and
when it does adapt, it typically performs badly on the training distribution
before the shift. Our method, Memorybased Parameter Adaptation, stores
examples in memory and then uses a contextbased lookup to directly modify the
weights of a neural network. Much higher learning rates can be used for this
local adaptation, reneging the need for many iterations over similar data
before good predictions can be made. As our method is memorybased, it
alleviates several shortcomings of neural networks, such as catastrophic
forgetting, fast, stable acquisition of new knowledge, learning with an
imbalanced class labels, and fast learning during evaluation. We demonstrate
this on a range of supervised tasks: largescale image classification and
language modelling.

Deep neural networks (DNNs) continue to make significant advances, solving
tasks from image classification to translation or reinforcement learning. One
aspect of the field receiving considerable attention is efficiently executing
deep models in resourceconstrained environments, such as mobile or embedded
devices. This paper focuses on this problem, and proposes two new compression
methods, which jointly leverage weight quantization and distillation of larger
teacher networks into smaller student networks. The first method we propose is
called quantized distillation and leverages distillation during the training
process, by incorporating distillation loss, expressed with respect to the
teacher, into the training of a student network whose weights are quantized to
a limited set of levels. The second method, differentiable quantization,
optimizes the location of quantization points through stochastic gradient
descent, to better fit the behavior of the teacher model. We validate both
methods through experiments on convolutional and recurrent architectures. We
show that quantized shallow students can reach similar accuracy levels to
fullprecision teacher models, while providing order of magnitude compression,
and inference speedup that is linear in the depth reduction. In sum, our
results enable DNNs for resourceconstrained environments to leverage
architecture and accuracy advances developed on more powerful devices.

We introduce ImaginationAugmented Agents (I2As), a novel architecture for
deep reinforcement learning combining modelfree and modelbased aspects. In
contrast to most existing modelbased reinforcement learning and planning
methods, which prescribe how a model should be used to arrive at a policy, I2As
learn to interpret predictions from a learned environment model to construct
implicit plans in arbitrary ways, by using the predictions as additional
context in deep policy networks. I2As show improved data efficiency,
performance, and robustness to model misspecification compared to several
baselines.

At the heart of deep learning we aim to use neural networks as function
approximators  training them to produce outputs from inputs in emulation of a
ground truth function or data creation process. In many cases we only have
access to inputoutput pairs from the ground truth, however it is becoming more
common to have access to derivatives of the target output with respect to the
input  for example when the ground truth function is itself a neural network
such as in network compression or distillation. Generally these target
derivatives are not computed, or are ignored. This paper introduces Sobolev
Training for neural networks, which is a method for incorporating these target
derivatives in addition the to target values while training. By optimising
neural networks to not only approximate the function's outputs but also the
function's derivatives we encode additional information about the target
function within the parameters of the neural network. Thereby we can improve
the quality of our predictors, as well as the dataefficiency and
generalization capabilities of our learned function approximation. We provide
theoretical justifications for such an approach as well as examples of
empirical evidence on three distinct domains: regression on classical
optimisation datasets, distilling policies of an agent playing Atari, and on
largescale applications of synthetic gradients. In all three domains the use
of Sobolev Training, employing target derivatives in addition to target values,
results in models with higher accuracy and stronger generalisation.

Conventional wisdom holds that modelbased planning is a powerful approach to
sequential decisionmaking. It is often very challenging in practice, however,
because while a model can be used to evaluate a plan, it does not prescribe how
to construct a plan. Here we introduce the "Imaginationbased Planner", the
first modelbased, sequential decisionmaking agent that can learn to
construct, evaluate, and execute plans. Before any action, it can perform a
variable number of imagination steps, which involve proposing an imagined
action and evaluating it with its modelbased imagination. All imagined actions
and outcomes are aggregated, iteratively, into a "plan context" which
conditions future real and imagined actions. The agent can even decide how to
imagine: testing out alternative imagined actions, chaining sequences of
actions together, or building a more complex "imagination tree" by navigating
flexibly among the previously imagined states using a learned policy. And our
agent can learn to plan economically, jointly optimizing for external rewards
and computational costs associated with using its imagination. We show that our
architecture can learn to solve a challenging continuous control problem, and
also learn elaborate planning strategies in a discrete mazesolving task. Our
work opens a new direction toward learning the components of a modelbased
planning system and how to use them.

Most deep reinforcement learning algorithms are data inefficient in complex
and rich environments, limiting their applicability to many scenarios. One
direction for improving data efficiency is multitask learning with shared
neural network parameters, where efficiency may be improved through transfer
across related tasks. In practice, however, this is not usually observed,
because gradients from different tasks can interfere negatively, making
learning unstable and sometimes even less data efficient. Another issue is the
different reward schemes between tasks, which can easily lead to one task
dominating the learning of a shared model. We propose a new approach for joint
training of multiple tasks, which we refer to as Distral (Distill & transfer
learning). Instead of sharing parameters between the different workers, we
propose to share a "distilled" policy that captures common behaviour across
tasks. Each worker is trained to solve its own task while constrained to stay
close to the shared policy, while the shared policy is trained by distillation
to be the centroid of all task policies. Both aspects of the learning process
are derived by optimizing a joint objective function. We show that our approach
supports efficient transfer on complex 3D environments, outperforming several
related methods. Moreover, the proposed learning process is more robust and
more stableattributes that are critical in deep reinforcement learning.

From just a glance, humans can make rich predictions about the future state
of a wide range of physical systems. On the other hand, modern approaches from
engineering, robotics, and graphics are often restricted to narrow domains and
require direct measurements of the underlying states. We introduce the Visual
Interaction Network, a generalpurpose model for learning the dynamics of a
physical system from raw visual observations. Our model consists of a
perceptual frontend based on convolutional neural networks and a dynamics
predictor based on interaction networks. Through joint training, the perceptual
frontend learns to parse a dynamic visual scene into a set of factored latent
object representations. The dynamics predictor learns to roll these states
forward in time by computing their interactions and dynamics, producing a
predicted physical trajectory of arbitrary length. We found that from just six
input video frames the Visual Interaction Network can generate accurate future
trajectories of hundreds of time steps on a wide range of physical systems. Our
model can also be applied to scenes with invisible objects, inferring their
future states from their effects on the visible objects, and can implicitly
infer the unknown mass of objects. Our results demonstrate that the perceptual
module and the objectbased dynamics predictor module can induce factored
latent representations that support accurate dynamical predictions. This work
opens new opportunities for modelbased decisionmaking and planning from raw
sensory observations in complex physical environments.

Relational reasoning is a central component of generally intelligent
behavior, but has proven difficult for neural networks to learn. In this paper
we describe how to use Relation Networks (RNs) as a simple plugandplay module
to solve problems that fundamentally hinge on relational reasoning. We tested
RNaugmented networks on three tasks: visual question answering using a
challenging dataset called CLEVR, on which we achieve stateoftheart,
superhuman performance; textbased question answering using the bAbI suite of
tasks; and complex reasoning about dynamic physical systems. Then, using a
curated dataset called SortofCLEVR we show that powerful convolutional
networks do not have a general capacity to solve relational questions, but can
gain this capacity when augmented with RNs. Our work shows how a deep learning
architecture equipped with an RN module can implicitly discover and learn to
reason about entities and their relations.

Despite their overwhelming capacity to overfit, deep learning architectures
tend to generalize relatively well to unseen data, allowing them to be deployed
in practice. However, explaining why this is the case is still an open area of
research. One standing hypothesis that is gaining popularity, e.g. Hochreiter &
Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the
loss function found by stochastic gradient based methods results in good
generalization. This paper argues that most notions of flatness are problematic
for deep models and can not be directly applied to explain generalization.
Specifically, when focusing on deep networks with rectifier units, we can
exploit the particular geometry of parameter space induced by the inherent
symmetries that these architectures exhibit to build equivalent models
corresponding to arbitrarily sharper minima. Furthermore, if we allow to
reparametrize a function, the geometry of its parameters can change drastically
without affecting its generalization properties.

Many machine learning systems are built to solve the hardest examples of a
particular task, which often makes them large and expensive to runespecially
with respect to the easier examples, which might require much less computation.
For an agent with a limited computational budget, this "onesizefitsall"
approach may result in the agent wasting valuable computation on easy examples,
while not spending enough on hard examples. Rather than learning a single,
fixed policy for solving all instances of a task, we introduce a metacontroller
which learns to optimize a sequence of "imagined" internal simulations over
predictive models of the world in order to construct a more informed, and more
economical, solution. The metacontroller component is a modelfree
reinforcement learning agent, which decides both how many iterations of the
optimization procedure to run, as well as which model to consult on each
iteration. The models (which we call "experts") can be state transition models,
actionvalue functions, or any other mechanism that provides information useful
for solving the task, and can be learned onpolicy or offpolicy in parallel
with the metacontroller. When the metacontroller, controller, and experts were
trained with "interaction networks" (Battaglia et al., 2016) as expert models,
our approach was able to solve a challenging decisionmaking problem under
complex nonlinear dynamics. The metacontroller learned to adapt the amount of
computation it performed to the difficulty of the task, and learned how to
choose which experts to consult by factoring in both their reliability and
individual computational resource costs. This allowed the metacontroller to
achieve a lower overall cost (task loss plus computational cost) than more
traditional fixed policy approaches. These results demonstrate that our
approach is a powerful framework for using...

There has been a lot of recent interest in trying to characterize the error
surface of deep models. This stems from a long standing question. Given that
deep networks are highly nonlinear systems optimized by local gradient methods,
why do they not seem to be affected by bad local minima? It is widely believed
that training of deep models using gradient methods works so well because the
error surface either has no local minima, or if they exist they need to be
close in value to the global minimum. It is known that such results hold under
very strong assumptions which are not satisfied by real models. In this paper
we present examples showing that for such theorem to be true additional
assumptions on the data, initialization schemes and/or the model classes have
to be made. We look at the particular case of finite size datasets. We
demonstrate that in this scenario one can construct counterexamples (datasets
or initialization schemes) when the network does become susceptible to bad
local minima over the weight space.

Our world can be succinctly and compactly described as structured scenes of
objects and relations. A typical room, for example, contains salient objects
such as tables, chairs and books, and these objects typically relate to each
other by their underlying causes and semantics. This gives rise to correlated
features, such as position, function and shape. Humans exploit knowledge of
objects and their relations for learning a wide spectrum of tasks, and more
generally when learning the structure underlying observed data. In this work,
we introduce relation networks (RNs)  a general purpose neural network
architecture for objectrelation reasoning. We show that RNs are capable of
learning object relations from scene description data. Furthermore, we show
that RNs can act as a bottleneck that induces the factorization of objects from
entangled scene description inputs, and from distributed deep representations
of scene images provided by a variational autoencoder. The model can also be
used in conjunction with differentiable memory mechanisms for implicit relation
discovery in oneshot learning tasks. Our results suggest that relation
networks are a potentially powerful architecture for solving a variety of
problems that require object relation reasoning.

The ability to learn tasks in a sequential fashion is crucial to the
development of artificial intelligence. Neural networks are not, in general,
capable of this and it has been widely thought that catastrophic forgetting is
an inevitable feature of connectionist models. We show that it is possible to
overcome this limitation and train networks that can maintain expertise on
tasks which they have not experienced for a long time. Our approach remembers
old tasks by selectively slowing down learning on the weights important for
those tasks. We demonstrate our approach is scalable and effective by solving a
set of classification tasks based on the MNIST hand written digit dataset and
by learning several Atari 2600 games sequentially.

Learning to navigate in complex environments with dynamic elements is an
important milestone in developing AI agents. In this work we formulate the
navigation question as a reinforcement learning problem and show that data
efficiency and task performance can be dramatically improved by relying on
additional auxiliary tasks leveraging multimodal sensory inputs. In particular
we consider jointly learning the goaldriven reinforcement learning problem
with auxiliary depth prediction and loop closure classification tasks. This
approach can learn to navigate from raw sensory input in complicated 3D mazes,
approaching humanlevel performance even under conditions where the goal
location changes frequently. We provide detailed analysis of the agent
behaviour, its ability to localise, and its network activity dynamics, showing
that the agent implicitly learns key navigation abilities.

Reasoning about objects, relations, and physics is central to human
intelligence, and a key goal of artificial intelligence. Here we introduce the
interaction network, a model which can reason about how objects in complex
systems interact, supporting dynamical predictions, as well as inferences about
the abstract properties of the system. Our model takes graphs as input,
performs object and relationcentric reasoning in a way that is analogous to a
simulation, and is implemented using deep neural networks. We evaluate its
ability to reason about several challenging physical domains: nbody problems,
rigidbody collision, and nonrigid dynamics. Our results show it can be
trained to accurately simulate the physical trajectories of dozens of objects
over thousands of time steps, estimate abstract quantities such as energy, and
generalize automatically to systems with different numbers and configurations
of objects and relations. Our interaction network implementation is the first
generalpurpose, learnable physics engine, and a powerful general framework for
reasoning about object and relations in a wide variety of complex realworld
domains.

Learning to solve complex sequences of taskswhile both leveraging transfer
and avoiding catastrophic forgettingremains a key obstacle to achieving
humanlevel intelligence. The progressive networks approach represents a step
forward in this direction: they are immune to forgetting and can leverage prior
knowledge via lateral connections to previously learned features. We evaluate
this architecture extensively on a wide variety of reinforcement learning tasks
(Atari and 3D maze games), and show that it outperforms common baselines based
on pretraining and finetuning. Using a novel sensitivity measure, we
demonstrate that transfer occurs at both lowlevel sensory and highlevel
control layers of the learned policy.

Theano is a Python library that allows to define, optimize, and evaluate
mathematical expressions involving multidimensional arrays efficiently. Since
its introduction, it has been one of the most used CPU and GPU mathematical
compilers  especially in the machine learning community  and has shown steady
performance improvements. Theano is being actively and continuously developed
since 2008, multiple frameworks have been built on top of it and it has been
used to produce many stateoftheart machine learning models.
The present article is structured as follows. Section I provides an overview
of the Theano software and its community. Section II presents the principal
features of Theano and how to use them, and compares them with other similar
projects. Section III focuses on recentlyintroduced functionalities and
improvements. Section IV compares the performance of Theano against Torch7 and
TensorFlow on several machine learning models. Section V discusses current
limitations of Theano and potential ways of improving it.

Policies for complex visual tasks have been successfully learned with deep
reinforcement learning, using an approach called deep Qnetworks (DQN), but
relatively large (taskspecific) networks and extensive training are needed to
achieve good performance. In this work, we present a novel method called policy
distillation that can be used to extract the policy of a reinforcement learning
agent and train a new network that performs at the expert level while being
dramatically smaller and more efficient. Furthermore, the same method can be
used to consolidate multiple taskspecific policies into a single policy. We
demonstrate these claims using the Atari domain and show that the multitask
distilled agent outperforms the singletask teachers as well as a
jointlytrained DQN agent.

We introduce Natural Neural Networks, a novel family of algorithms that speed
up convergence by adapting their internal representation during training to
improve conditioning of the Fisher matrix. In particular, we show a specific
example that employs a simple and efficient reparametrization of the neural
network weights by implicitly whitening the representation obtained at each
layer, while preserving the feedforward computation of the network. Such
networks can be trained efficiently via the proposed Projected Natural Gradient
Descent algorithm (PRONG), which amortizes the cost of these reparametrizations
over many parameter updates and is closely related to the Mirror Descent online
learning algorithm. We highlight the benefits of our method on both
unsupervised and supervised learning tasks, and showcase its scalability by
training on the largescale ImageNet Challenge dataset.

In this paper we propose and investigate a novel nonlinear unit, called $L_p$
unit, for deep neural networks. The proposed $L_p$ unit receives signals from
several projections of a subset of units in the layer below and computes a
normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$
unit. First, the proposed unit can be understood as a generalization of a
number of conventional pooling operators such as average, rootmeansquare and
max pooling widely used in, for instance, convolutional neural networks (CNN),
HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain
degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013)
which achieved the stateoftheart object recognition results on a number of
benchmark datasets. Secondly, we provide a geometrical interpretation of the
activation function based on which we argue that the $L_p$ unit is more
efficient at representing complex, nonlinear separating boundaries. Each $L_p$
unit defines a superelliptic boundary, with its exact shape defined by the
order $p$. We claim that this makes it possible to model arbitrarily shaped,
curved boundaries more efficiently by combining a few $L_p$ units of different
orders. This insight justifies the need for learning different orders for each
unit in the model. We empirically evaluate the proposed $L_p$ units on a number
of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$
units achieve the stateoftheart results on a number of benchmark datasets.
Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep
recurrent neural networks (RNN).

A central challenge to many fields of science and engineering involves
minimizing nonconvex error functions over continuous, high dimensional spaces.
Gradient descent or quasiNewton methods are almost ubiquitously used to
perform such minimizations, and it is often thought that a main source of
difficulty for these local methods to find the global minimum is the
proliferation of local minima with much higher error than the global minimum.
Here we argue, based on results from statistical physics, random matrix theory,
neural network theory, and empirical evidence, that a deeper and more profound
difficulty originates from the proliferation of saddle points, not local
minima, especially in high dimensional problems of practical interest. Such
saddle points are surrounded by high error plateaus that can dramatically slow
down learning, and give the illusory impression of the existence of a local
minimum. Motivated by these arguments, we propose a new approach to
secondorder optimization, the saddlefree Newton method, that can rapidly
escape high dimensional saddle points, unlike gradient descent and quasiNewton
methods. We apply this algorithm to deep or recurrent neural network training,
and provide numerical evidence for its superior optimization performance.

We study the complexity of functions computable by deep feedforward neural
networks with piecewise linear activations in terms of the symmetries and the
number of linear regions that they have. Deep networks are able to sequentially
map portions of each layer's inputspace to the same output. In this way, deep
models compute functions that react equally to complicated patterns of
different inputs. The compositional structure of these functions enables them
to reuse pieces of computation exponentially often in terms of the network's
depth. This paper investigates the complexity of such compositional maps and
contributes new theoretical results regarding the advantage of depth for neural
networks with piecewise linear activation functions. In particular, our
analysis is not specific to a single family of models, and as an example, we
employ it for rectifier and maxout networks. We improve complexity bounds from
preexisting work and investigate the behavior of units in higher layers.