
Policy gradient methods are an appealing approach in reinforcement learning
because they directly optimize the cumulative reward and can straightforwardly
be used with nonlinear function approximators such as neural networks. The two
main challenges are the large number of samples typically required, and the
difficulty of obtaining stable and steady improvement despite the
nonstationarity of the incoming data. We address the first challenge by using
value functions to substantially reduce the variance of policy gradient
estimates at the cost of some bias, with an exponentiallyweighted estimator of
the advantage function that is analogous to TD(lambda). We address the second
challenge by using trust region optimization procedure for both the policy and
the value function, which are represented by neural networks.
Our approach yields strong empirical results on highly challenging 3D
locomotion tasks, learning running gaits for bipedal and quadrupedal simulated
robots, and learning a policy for getting the biped to stand up from starting
out lying on the ground. In contrast to a body of prior work that uses
handcrafted policy representations, our neural network policies map directly
from raw kinematics to joint torques. Our algorithm is fully modelfree, and
the amount of simulated experience required for the learning tasks on 3D bipeds
corresponds to 12 weeks of real time.

Two of the leading approaches for modelfree reinforcement learning are
policy gradient methods and $Q$learning methods. $Q$learning methods can be
effective and sampleefficient when they work, however, it is not
wellunderstood why they work, since empirically, the $Q$values they estimate
are very inaccurate. A partial explanation may be that $Q$learning methods are
secretly implementing policy gradient updates: we show that there is a precise
equivalence between $Q$learning and policy gradient methods in the setting of
entropyregularized reinforcement learning, that "soft" (entropyregularized)
$Q$learning is exactly equivalent to a policy gradient method. We also point
out a connection between $Q$learning methods and natural policy gradient
methods. Experimentally, we explore the entropyregularized versions of
$Q$learning and policy gradients, and we find them to perform as well as (or
slightly better than) the standard variants on the Atari benchmark. We also
show that the equivalence holds in practical settings by constructing a
$Q$learning method that closely matches the learning dynamics of A3C without
using a target network or $\epsilon$greedy exploration schedule.

Dexterous multifingered hands are extremely versatile and provide a generic
way to perform a multitude of tasks in humancentric environments. However,
effectively controlling them remains challenging due to their high
dimensionality and large number of potential contacts. Deep reinforcement
learning (DRL) provides a modelagnostic approach to control complex dynamical
systems, but has not been shown to scale to highdimensional dexterous
manipulation. Furthermore, deployment of DRL on physical systems remains
challenging due to sample inefficiency. Consequently, the success of DRL in
robotics has thus far been limited to simpler manipulators and tasks. In this
work, we show that modelfree DRL can effectively scale up to complex
manipulation tasks with a highdimensional 24DoF hand, and solve them from
scratch in simulated experiments. Furthermore, with the use of a small number
of human demonstrations, the sample complexity can be significantly reduced,
which enables learning with sample sizes equivalent to a few hours of robot
experience. The use of demonstrations result in policies that exhibit very
natural movements and, surprisingly, are also substantially more robust.

In this report, we present a new reinforcement learning (RL) benchmark based
on the Sonic the Hedgehog (TM) video game franchise. This benchmark is intended
to measure the performance of transfer learning and fewshot learning
algorithms in the RL domain. We also present and evaluate some baseline
algorithms on the new benchmark.

This paper considers metalearning problems, where there is a distribution of
tasks, and we would like to obtain an agent that performs well (i.e., learns
quickly) when presented with a previously unseen task sampled from this
distribution. We analyze a family of algorithms for learning a parameter
initialization that can be finetuned quickly on a new task, using only
firstorder derivatives for the metalearning updates. This family includes and
generalizes firstorder MAML, an approximation to MAML obtained by ignoring
secondorder derivatives. It also includes Reptile, a new algorithm that we
introduce here, which works by repeatedly sampling a task, training on it, and
moving the initialization towards the trained weights on that task. We expand
on the results from Finn et al. showing that firstorder metalearning
algorithms perform well on some wellestablished benchmarks for fewshot
classification, and we provide theoretical analysis aimed at understanding why
these algorithms work.

We propose a new family of policy gradient methods for reinforcement
learning, which alternate between sampling data through interaction with the
environment, and optimizing a "surrogate" objective function using stochastic
gradient ascent. Whereas standard policy gradient methods perform one gradient
update per data sample, we propose a novel objective function that enables
multiple epochs of minibatch updates. The new methods, which we call proximal
policy optimization (PPO), have some of the benefits of trust region policy
optimization (TRPO), but they are much simpler to implement, more general, and
have better sample complexity (empirically). Our experiments test PPO on a
collection of benchmark tasks, including simulated robotic locomotion and Atari
game playing, and we show that PPO outperforms other online policy gradient
methods, and overall strikes a favorable balance between sample complexity,
simplicity, and walltime.

We propose TeacherStudent Curriculum Learning (TSCL), a framework for
automatic curriculum learning, where the Student tries to learn a complex task
and the Teacher automatically chooses subtasks from a given set for the Student
to train on. We describe a family of Teacher algorithms that rely on the
intuition that the Student should practice more those tasks on which it makes
the fastest progress, i.e. where the slope of the learning curve is highest. In
addition, the Teacher algorithms address the problem of forgetting by also
choosing tasks where the Student's performance is getting worse. We demonstrate
that TSCL matches or surpasses the results of carefully handcrafted curricula
in two tasks: addition of decimal numbers with LSTM and navigation in
Minecraft. Using our automatically generated curriculum enabled to solve a
Minecraft maze that could not be solved at all when training directly on
solving the maze, and the learning was an order of magnitude faster than
uniform sampling of subtasks.

We show how an ensemble of $Q^*$functions can be leveraged for more
effective exploration in deep reinforcement learning. We build on well
established algorithms from the bandit setting, and adapt them to the
$Q$learning setting. First we propose an exploration strategy based on
upperconfidence bounds (UCB). Next, we define an "InfoGain" exploration bonus,
which depends on the disagreement of the $Q$ensemble. Our experiments show
significant gains on the Atari benchmark.

We describe an iterative procedure for optimizing policies, with guaranteed
monotonic improvement. By making several approximations to the
theoreticallyjustified procedure, we develop a practical algorithm, called
Trust Region Policy Optimization (TRPO). This algorithm is similar to natural
policy gradient methods and is effective for optimizing large nonlinear
policies such as neural networks. Our experiments demonstrate its robust
performance on a wide variety of tasks: learning simulated robotic swimming,
hopping, and walking gaits; and playing Atari games using images of the screen
as input. Despite its approximations that deviate from the theory, TRPO tends
to give monotonic improvement, with little tuning of hyperparameters.

Representation learning seeks to expose certain aspects of observed data in a
learned representation that's amenable to downstream tasks like classification.
For instance, a good representation for 2D images might be one that describes
only global structure and discards information about detailed texture. In this
paper, we present a simple but principled method to learn such global
representations by combining Variational Autoencoder (VAE) with neural
autoregressive models such as RNN, MADE and PixelRNN/CNN. Our proposed VAE
model allows us to have control over what the global latent code can learn and
, by designing the architecture accordingly, we can force the global latent
code to discard irrelevant information such as texture in 2D images, and hence
the VAE only "autoencodes" data in a lossy fashion. In addition, by leveraging
autoregressive models as both prior distribution $p(z)$ and decoding
distribution $p(xz)$, we can greatly improve generative modeling performance
of VAEs, achieving new stateoftheart results on MNIST, OMNIGLOT and
Caltech101 Silhouettes density estimation tasks.

Scalable and effective exploration remains a key challenge in reinforcement
learning (RL). While there are methods with optimality guarantees in the
setting of discrete state and action spaces, these methods cannot be applied in
highdimensional deep RL scenarios. As such, most contemporary RL relies on
simple heuristics such as epsilongreedy exploration or adding Gaussian noise
to the controls. This paper introduces Variational Information Maximizing
Exploration (VIME), an exploration strategy based on maximization of
information gain about the agent's belief of environment dynamics. We propose a
practical implementation, using variational inference in Bayesian neural
networks which efficiently handles continuous state and action spaces. VIME
modifies the MDP reward function, and can be applied with several different
underlying RL algorithms. We demonstrate that VIME achieves significantly
better performance compared to heuristic exploration methods across a variety
of continuous control tasks and algorithms, including tasks with very sparse
rewards.

Countbased exploration algorithms are known to perform nearoptimally when
used in conjunction with tabular reinforcement learning (RL) methods for
solving small discrete Markov decision processes (MDPs). It is generally
thought that countbased methods cannot be applied in highdimensional state
spaces, since most states will only occur once. Recent deep RL exploration
strategies are able to deal with highdimensional continuous state spaces
through complex heuristics, often relying on optimism in the face of
uncertainty or intrinsic motivation. In this work, we describe a surprising
finding: a simple generalization of the classic countbased approach can reach
near stateoftheart performance on various highdimensional and/or continuous
deep RL benchmarks. States are mapped to hash codes, which allows to count
their occurrences with a hash table. These counts are then used to compute a
reward bonus according to the classic countbased exploration theory. We find
that simple hash functions can achieve surprisingly good results on many
challenging tasks. Furthermore, we show that a domaindependent learned hash
code may further improve these results. Detailed analysis reveals important
aspects of a good hash function: 1) having appropriate granularity and 2)
encoding information relevant to solving the MDP. This exploration strategy
achieves near stateoftheart performance on both continuous control tasks and
Atari 2600 games, hence providing a simple yet powerful baseline for solving
MDPs that require considerable exploration.

Deep reinforcement learning (deep RL) has been successful in learning
sophisticated behaviors automatically; however, the learning process requires a
huge number of trials. In contrast, animals can learn new tasks in just a few
trials, benefiting from their prior knowledge about the world. This paper seeks
to bridge this gap. Rather than designing a "fast" reinforcement learning
algorithm, we propose to represent it as a recurrent neural network (RNN) and
learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in
the weights of the RNN, which are learned slowly through a generalpurpose
("slow") RL algorithm. The RNN receives all information a typical RL algorithm
would receive, including observations, actions, rewards, and termination flags;
and it retains its state across episodes in a given Markov Decision Process
(MDP). The activations of the RNN store the state of the "fast" RL algorithm on
the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both
smallscale and largescale problems. On the smallscale side, we train it to
solve randomly generated multiarm bandit problems and finite MDPs. After
RL$^2$ is trained, its performance on new MDPs is close to humandesigned
algorithms with optimality guarantees. On the largescale side, we test RL$^2$
on a visionbased navigation task and show that it scales up to
highdimensional problems.

Rapid progress in machine learning and artificial intelligence (AI) has
brought increasing attention to the potential impacts of AI technologies on
society. In this paper we discuss one such potential impact: the problem of
accidents in machine learning systems, defined as unintended and harmful
behavior that may emerge from poor design of realworld AI systems. We present
a list of five practical research problems related to accident risk,
categorized according to whether the problem originates from having the wrong
objective function ("avoiding side effects" and "avoiding reward hacking"), an
objective function that is too expensive to evaluate frequently ("scalable
supervision"), or undesirable behavior during the learning process ("safe
exploration" and "distributional shift"). We review previous work in these
areas as well as suggesting research directions with a focus on relevance to
cuttingedge AI systems. Finally, we consider the highlevel question of how to
think most productively about the safety of forwardlooking applications of AI.

This paper describes InfoGAN, an informationtheoretic extension to the
Generative Adversarial Network that is able to learn disentangled
representations in a completely unsupervised manner. InfoGAN is a generative
adversarial network that also maximizes the mutual information between a small
subset of the latent variables and the observation. We derive a lower bound to
the mutual information objective that can be optimized efficiently, and show
that our training procedure can be interpreted as a variation of the WakeSleep
algorithm. Specifically, InfoGAN successfully disentangles writing styles from
digit shapes on the MNIST dataset, pose from lighting of 3D rendered images,
and background digits from the central digit on the SVHN dataset. It also
discovers visual concepts that include hair styles, presence/absence of
eyeglasses, and emotions on the CelebA face dataset. Experiments show that
InfoGAN learns interpretable representations that are competitive with
representations learned by existing fully supervised methods.

OpenAI Gym is a toolkit for reinforcement learning research. It includes a
growing collection of benchmark problems that expose a common interface, and a
website where people can share their results and compare the performance of
algorithms. This whitepaper discusses the components of OpenAI Gym and the
design decisions that went into the software.

Recently, researchers have made significant progress combining the advances
in deep learning for learning feature representations with reinforcement
learning. Some notable examples include training agents to play Atari games
based on raw pixel data and to acquire advanced manipulation skills using raw
sensory inputs. However, it has been difficult to quantify progress in the
domain of continuous control due to the lack of a commonly adopted benchmark.
In this work, we present a benchmark suite of continuous control tasks,
including classic tasks like cartpole swingup, tasks with very high state and
action dimensionality such as 3D humanoid locomotion, tasks with partial
observations, and tasks with hierarchical structure. We report novel findings
based on the systematic evaluation of a range of implemented reinforcement
learning algorithms. Both the benchmark and reference implementations are
released at https://github.com/rllab/rllab in order to facilitate experimental
reproducibility and to encourage adoption by other researchers.

Two results regarding K\"ahler supermanifolds with potential
$K=A+C\theta\bar\theta$ are shown. First, if the supermanifold is
K\"ahlerEinstein, then its base (the supermanifold of one lower fermionic
dimension and with K\"ahler potential $A$) has constant scalar curvature. As a
corollary, every constant scalar curvature K\"ahler supermanifold has a unique
superextension to a K\"ahlerEinstein supermanifold of one higher fermionic
dimension. Second, if the supermanifold is itself scalar flat, then its base
satisfies the equation $$ \phi^{\bar ji}\phi_{i\bar j}=2\Delta_0 S_0 +
R_0^{\bar ji}R_{0i\bar j}  S_0^2, $$ where $\Delta_0$ is the Laplace operator,
$S_0$ is the scalar curvature, and $R_{0i\bar j}$ is the Ricci tensor of the
base, and $\phi$ is some harmonic section on the base. Remarkably, precisely
this equation arises in the construction of certain supergravity
compactifications. Examples of bosonic manifolds satisfying the equation above
are discussed.

Theano is a Python library that allows to define, optimize, and evaluate
mathematical expressions involving multidimensional arrays efficiently. Since
its introduction, it has been one of the most used CPU and GPU mathematical
compilers  especially in the machine learning community  and has shown steady
performance improvements. Theano is being actively and continuously developed
since 2008, multiple frameworks have been built on top of it and it has been
used to produce many stateoftheart machine learning models.
The present article is structured as follows. Section I provides an overview
of the Theano software and its community. Section II presents the principal
features of Theano and how to use them, and compares them with other similar
projects. Section III focuses on recentlyintroduced functionalities and
improvements. Section IV compares the performance of Theano against Torch7 and
TensorFlow on several machine learning models. Section V discusses current
limitations of Theano and potential ways of improving it.

In a variety of problems originating in supervised, unsupervised, and
reinforcement learning, the loss function is defined by an expectation over a
collection of random variables, which might be part of a probabilistic model or
the external world. Estimating the gradient of this loss function, using
samples, lies at the core of gradientbased learning algorithms for these
problems. We introduce the formalism of stochastic computation
graphsdirected acyclic graphs that include both deterministic functions and
conditional probability distributionsand describe how to easily and
automatically derive an unbiased estimator of the loss function's gradient. The
resulting algorithm for computing the gradient estimator is a simple
modification of the standard backpropagation algorithm. The generic scheme we
propose unifies estimators derived in variety of prior work, along with
variancereduction techniques therein. It could assist researchers in
developing intricate models involving a combination of stochastic and
deterministic operations, enabling, for example, attention, memory, and control
actions.