
We learn recurrent neural network optimizers trained on simple synthetic
functions by gradient descent. We show that these learned optimizers exhibit a
remarkable degree of transfer in that they can be used to efficiently optimize
a broad range of derivativefree blackbox functions, including Gaussian
process bandits, simple control objectives, global optimization benchmarks and
hyperparameter tuning tasks. Up to the training horizon, the learned
optimizers learn to tradeoff exploration and exploitation, and compare
favourably with heavily engineered Bayesian optimization packages for
hyperparameter tuning.

Deep learning has led to significant advances in artificial intelligence, in
part, by adopting strategies motivated by neurophysiology. However, it is
unclear whether deep learning could occur in the real brain. Here, we show that
a deep learning algorithm that utilizes multicompartment neurons might help us
to understand how the brain optimizes cost functions. Like neocortical
pyramidal neurons, neurons in our model receive sensory information and
higherorder feedback in electrotonically segregated compartments. Thanks to
this segregation, the neurons in different layers of the network can coordinate
synaptic weight updates. As a result, the network can learn to categorize
images better than a single layer network. Furthermore, we show that our
algorithm takes advantage of multilayer architectures to identify useful
representationsthe hallmark of deep learning. This work demonstrates that
deep learning can be achieved using segregated dendritic compartments, which
may help to explain the dendritic morphology of neocortical pyramidal neurons.

We propose a conceptually simple and lightweight framework for deep
reinforcement learning that uses asynchronous gradient descent for optimization
of deep neural network controllers. We present asynchronous variants of four
standard reinforcement learning algorithms and show that parallel
actorlearners have a stabilizing effect on training allowing all four methods
to successfully train neural network controllers. The best performing method,
an asynchronous variant of actorcritic, surpasses the current stateoftheart
on the Atari domain while training for half the time on a single multicore CPU
instead of a GPU. Furthermore, we show that asynchronous actorcritic succeeds
on a wide variety of continuous motor control problems as well as on a new task
of navigating random 3D mazes using a visual input.

We adapt the ideas underlying the success of Deep QLearning to the
continuous action domain. We present an actorcritic, modelfree algorithm
based on the deterministic policy gradient that can operate over continuous
action spaces. Using the same learning algorithm, network architecture and
hyperparameters, our algorithm robustly solves more than 20 simulated physics
tasks, including classic problems such as cartpole swingup, dexterous
manipulation, legged locomotion and car driving. Our algorithm is able to find
policies whose performance is competitive with those found by a planning
algorithm with full access to the dynamics of the domain and its derivatives.
We further demonstrate that for many of the tasks the algorithm can learn
policies endtoend: directly from raw pixel inputs.

The brain processes information through many layers of neurons. This deep
architecture is representationally powerful, but it complicates learning by
making it hard to identify the responsible neurons when a mistake is made. In
machine learning, the backpropagation algorithm assigns blame to a neuron by
computing exactly how it contributed to an error. To do this, it multiplies
error signals by matrices consisting of all the synaptic weights on the
neuron's axon and farther downstream. This operation requires a precisely
choreographed transport of synaptic weight information, which is thought to be
impossible in the brain. Here we present a surprisingly simple algorithm for
deep learning, which assigns blame by multiplying error signals by random
synaptic weights. We show that a network can learn to extract useful
information from signals sent through these random feedback connections. In
essence, the network learns to learn. We demonstrate that this new mechanism
performs as quickly and accurately as backpropagation on a variety of problems
and describe the principles which underlie its function. Our demonstration
provides a plausible basis for how a neuron can be adapted using error signals
generated at distal locations in the brain, and thus dispels longheld
assumptions about the algorithmic constraints on learning in neural circuits.