
To make AI systems broadly useful for challenging realworld tasks, we need
them to learn complex human goals and preferences. One approach to specifying
complex goals asks humans to judge during training which agent behaviors are
safe and useful, but this approach can fail if the task is too complicated for
a human to directly judge. To help address this concern, we propose training
agents via self play on a zero sum debate game. Given a question or proposed
action, two agents take turns making short statements up to a limit, then a
human judges which of the agents gave the most true, useful information. In an
analogy to complexity theory, debate with optimal play can answer any question
in PSPACE given polynomial time judges (direct judging answers only NP
questions). In practice, whether debate works involves empirical questions
about humans and the tasks we want AIs to perform, plus theoretical questions
about the meaning of AI alignment. We report results on an initial MNIST
experiment where agents compete to convince a sparse classifier, boosting the
classifier's accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to
85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of
the debate model, focusing on potential weaknesses as the model scales up, and
we propose future human and computer experiments to test these properties.

We give a protocol for producing certifiable randomness from a single
untrusted quantum device that is polynomialtime bounded. The randomness is
certified to be statistically close to uniform from the point of view of any
computationally unbounded quantum adversary, that may share entanglement with
the quantum device. The protocol relies on the existence of postquantum secure
trapdoor clawfree functions, and introduces a new primitive for constraining
the power of an untrusted quantum device. We then show how to construct this
primitive based on the hardness of the learning with errors (LWE) problem.
The randomness protocol can also be used as the basis for an efficiently
verifiable quantum supremacy proposal, thus answering an outstanding challenge
in the field.

For sophisticated reinforcement learning (RL) systems to interact usefully
with realworld environments, we need to communicate complex goals to these
systems. In this work, we explore goals defined in terms of (nonexpert) human
preferences between pairs of trajectory segments. We show that this approach
can effectively solve complex RL tasks without access to the reward function,
including Atari games and simulated robot locomotion, while providing feedback
on less than one percent of our agent's interactions with the environment. This
reduces the cost of human oversight far enough that it can be practically
applied to stateoftheart RL systems. To demonstrate the flexibility of our
approach, we show that we can successfully train complex novel behaviors with
about an hour of human time. These behaviors and environments are considerably
more complex than any that have been previously learned from human feedback.

Generative adversarial networks (GANs) are a recently proposed class of
generative models in which a generator is trained to optimize a cost function
that is being simultaneously learned by a discriminator. While the idea of
learning cost functions is relatively new to the field of generative modeling,
learning costs has long been studied in control and reinforcement learning (RL)
domains, typically for imitation learning from demonstrations. In these fields,
learning cost function underlying observed behavior is known as inverse
reinforcement learning (IRL) or inverse optimal control. While at first the
connection between cost learning in RL and cost learning in generative modeling
may appear to be a superficial one, we show in this paper that certain IRL
methods are in fact mathematically equivalent to GANs. In particular, we
demonstrate an equivalence between a samplebased algorithm for maximum entropy
IRL and a GAN in which the generator's density can be evaluated and is provided
as an additional input to the discriminator. Interestingly, maximum entropy IRL
is a special case of an energybased model. We discuss the interpretation of
GANs as an algorithm for training energybased models, and relate this
interpretation to other recent work that seeks to connect GANs and EBMs. By
formally highlighting the connection between GANs, IRL, and EBMs, we hope that
researchers in all three communities can better identify and apply transferable
ideas from one domain to another, particularly for developing more stable and
scalable algorithms: a major challenge in all three domains.

Developing control policies in simulation is often more practical and safer
than directly running experiments in the real world. This applies to policies
obtained from planning and optimization, and even more so to policies obtained
from reinforcement learning, which is often very data demanding. However, a
policy that succeeds in simulation often doesn't work when deployed on a real
robot. Nevertheless, often the overall gist of what the policy does in
simulation remains valid in the real world. In this paper we investigate such
settings, where the sequence of states traversed in simulation remains
reasonable for the real world, even if the details of the controls are not, as
could be the case when the key differences lie in detailed friction, contact,
mass and geometry properties. During execution, at each time step our approach
computes what the simulationbased control policy would do, but then, rather
than executing these controls on the real robot, our approach computes what the
simulation expects the resulting next state(s) will be, and then relies on a
learned deep inverse dynamics model to decide which realworld action is most
suitable to achieve those next states. Deep models are only as good as their
training data, and we also propose an approach for data collection to
(incrementally) learn the deep inverse dynamics model. Our experiments shows
our approach compares favorably with various baselines that have been developed
for dealing with simulation to real world model discrepancy, including output
error control and Gaussian dynamics adaptation.

Rapid progress in machine learning and artificial intelligence (AI) has
brought increasing attention to the potential impacts of AI technologies on
society. In this paper we discuss one such potential impact: the problem of
accidents in machine learning systems, defined as unintended and harmful
behavior that may emerge from poor design of realworld AI systems. We present
a list of five practical research problems related to accident risk,
categorized according to whether the problem originates from having the wrong
objective function ("avoiding side effects" and "avoiding reward hacking"), an
objective function that is too expensive to evaluate frequently ("scalable
supervision"), or undesirable behavior during the learning process ("safe
exploration" and "distributional shift"). We review previous work in these
areas as well as suggesting research directions with a focus on relevance to
cuttingedge AI systems. Finally, we consider the highlevel question of how to
think most productively about the safety of forwardlooking applications of AI.

Theano is a Python library that allows to define, optimize, and evaluate
mathematical expressions involving multidimensional arrays efficiently. Since
its introduction, it has been one of the most used CPU and GPU mathematical
compilers  especially in the machine learning community  and has shown steady
performance improvements. Theano is being actively and continuously developed
since 2008, multiple frameworks have been built on top of it and it has been
used to produce many stateoftheart machine learning models.
The present article is structured as follows. Section I provides an overview
of the Theano software and its community. Section II presents the principal
features of Theano and how to use them, and compares them with other similar
projects. Section III focuses on recentlyintroduced functionalities and
improvements. Section IV compares the performance of Theano against Torch7 and
TensorFlow on several machine learning models. Section V discusses current
limitations of Theano and potential ways of improving it.

Many practical learning systems aggregate data across many users, while
learning theory traditionally considers a single learner who trusts all of
their observations. A case in point is the foundational learning problem of
prediction with expert advice. To date, there has been no theoretical study of
the general collaborative version of prediction with expert advice, in which
many users face a similar problem and would like to share their experiences in
order to learn faster. A key issue in this collaborative framework is
robustness: generally algorithms that aggregate data are vulnerable to
manipulation by even a small number of dishonest users.
We exhibit the first robust collaborative algorithm for prediction with
expert advice. When all users are honest and have similar tastes our algorithm
matches the performance of pooling data and using a traditional algorithm. But
our algorithm also guarantees that adding users never significantly degrades
performance, even if the additional users behave adversarially. We achieve
strong guarantees even when the overwhelming majority of users behave
adversarially. As a special case, our algorithm is extremely robust to
variation amongst the users.

We consider a community of users who must make periodic decisions about
whether to interact with one another. We propose a protocol which allows honest
users to reliably interact with each other, while limiting the damage done by
each malicious or incompetent user. The worstcase cost per user is sublinear
in the average number of interactions per user and is independent of the number
of users. Our guarantee holds simultaneously for every group of honest users.
For example, multiple groups of users with incompatible tastes or preferences
can coexist.
As a motivating example, we consider a game where players have periodic
opportunities to do one another favors but minimal ability to determine when a
favor was done. In this setting, our protocol achieves nearly optimal
collective welfare while remaining resistant to exploitation.
Our results also apply to a collaborative filtering setting where users must
make periodic decisions about whether to interact with resources such as movies
or restaurants. In this setting, we guarantee that any set of honest users
achieves a payoff nearly as good as if they had identified the optimal set of
items in advance and then chosen to interact only with resources from that set.

In many online learning problems we are interested in predicting local
information about some universe of items. For example, we may want to know
whether two items are in the same cluster rather than computing an assignment
of items to clusters; we may want to know which of two teams will win a game
rather than computing a ranking of teams. Although finding the optimal
clustering or ranking is typically intractable, it may be possible to predict
the relationships between items as well as if you could solve the global
optimization problem exactly.
Formally, we consider an online learning problem in which a learner
repeatedly guesses a pair of labels (l(x), l(y)) and receives an adversarial
payoff depending on those labels. The learner's goal is to receive a payoff
nearly as good as the best fixed labeling of the items. We show that a simple
algorithm based on semidefinite programming can obtain asymptotically optimal
regret in the case where the number of possible labels is O(1), resolving an
open problem posed by Hazan, Kale, and ShalevSchwartz. Our main technical
contribution is a novel use and analysis of the log determinant regularizer,
exploiting the observation that log det(A + I) upper bounds the entropy of any
distribution with covariance matrix A.

We consider the oneshot Prisoner's Dilemma between algorithms with
readaccess to one anothers' source codes, and we use the modal logic of
provability to build agents that can achieve mutual cooperation in a manner
that is robust, in that cooperation does not require exact equality of the
agents' source code, and unexploitable, meaning that such an agent never
cooperates when its opponent defects. We construct a general framework for such
"modal agents", and study their properties.

Forty years ago, Wiesner pointed out that quantum mechanics raises the
striking possibility of money that cannot be counterfeited according to the
laws of physics. We propose the first quantum money scheme that is (1)
publickey, meaning that anyone can verify a banknote as genuine, not only the
bank that printed it, and (2) cryptographically secure, under a "classical"
hardness assumption that has nothing to do with quantum money. Our scheme is
based on hidden subspaces, encoded as the zerosets of random multivariate
polynomials. A main technical advance is to show that the "blackbox" version
of our scheme, where the polynomials are replaced by classical oracles, is
unconditionally secure. Previously, such a result had only been known relative
to a quantum oracle (and even there, the proof was never published). Even in
Wiesner's original setting  quantum money that can only be verified by the
bank  we are able to use our techniques to patch a major security hole in
Wiesner's scheme. We give the first privatekey quantum money scheme that
allows unlimited verifications and that remains unconditionally secure, even if
the counterfeiter can interact adaptively with the bank. Our money scheme is
simpler than previous publickey quantum money schemes, including a knotbased
scheme of Farhi et al. The verifier needs to perform only two tests, one in the
standard basis and one in the Hadamard basis  matching the original intuition
for quantum money, based on the existence of complementary observables. Our
security proofs use a new variant of Ambainis's quantum adversary method, and
several other tools that might be of independent interest.

We introduce a new approach to computing an approximately maximum st flow in
a capacitated, undirected graph. This flow is computed by solving a sequence of
electrical flow problems. Each electrical flow is given by the solution of a
system of linear equations in a Laplacian matrix, and thus may be approximately
computed in nearlylinear time.
Using this approach, we develop the fastest known algorithm for computing
approximately maximum st flows. For a graph having n vertices and m edges, our
algorithm computes a (1\epsilon)approximately maximum st flow in time
\tilde{O}(mn^{1/3} \epsilon^{11/3}). A dual version of our approach computes a
(1+\epsilon)approximately minimum st cut in time
\tilde{O}(m+n^{4/3}\eps^{8/3}), which is the fastest known algorithm for this
problem as well. Previously, the best dependence on m and n was achieved by the
algorithm of Goldberg and Rao (J. ACM 1998), which can be used to compute
approximately maximum st flows in time \tilde{O}(m\sqrt{n}\epsilon^{1}), and
approximately minimum st cuts in time \tilde{O}(m+n^{3/2}\epsilon^{3}).