
The key challenge in multiagent learning is learning a best response to the
behaviour of other agents, which may be nonstationary: if the other agents
adapt their strategy as well, the learning target moves. Disparate streams of
research have approached nonstationarity from several angles, which make a
variety of implicit assumptions that make it hard to keep an overview of the
state of the art and to validate the innovation and significance of new works.
This survey presents a coherent overview of work that addresses
opponentinduced nonstationarity with tools from game theory, reinforcement
learning and multiarmed bandits. Further, we reflect on the principle
approaches how algorithms model and cope with this nonstationarity, arriving
at a new framework and five categories (in increasing order of sophistication):
ignore, forget, respond to target models, learn models, and theory of mind. A
wide range of stateoftheart algorithms is classified into a taxonomy, using
these categories and key characteristics of the environment (e.g.,
observability) and adaptation behaviour of the opponents (e.g., smooth,
abrupt). To clarify even further we present illustrative variations of one
domain, contrasting the strengths and limitations of each category. Finally, we
discuss in which environments the different approaches yield most merit, and
point to promising avenues of future research.

We propose a fully distributed actorcritic algorithm approximated by deep
neural networks, named \textit{DiffDAC}, with application to singletask and
to average multitask reinforcement learning (MRL). Each agent has access to
data from its local task only, but it aims to learn a policy that performs well
on average for the whole set of tasks. During the learning process, agents
communicate their valuepolicy parameters to their neighbors, diffusing the
information across the network, so that they converge to a common policy, with
no need for a central node. The method is scalable, since the computational and
communication costs per agent grow with its number of neighbors. We derive
DiffDAC's from duality theory and provide novel insights into the standard
actorcritic framework, showing that it is actually an instance of the dual
ascent method that approximates the solution of a linear program. Experiments
suggest that DiffDAC can outperform the single previous distributed MRL
approach (i.e., DistMTLPS) and even the centralized architecture.

Although multiagent reinforcement learning can tackle systems of
strategically interacting entities, it currently fails in scalability and lacks
rigorous convergence guarantees. Crucially, learning in multiagent systems can
become intractable due to the explosion in the size of the stateaction space
as the number of agents increases. In this paper, we propose a method for
computing closedloop optimal policies in multiagent systems that scales
independently of the number of agents. This allows us to show, for the first
time, successful convergence to optimal behaviour in systems with an unbounded
number of interacting adaptive learners. Studying the asymptotic regime of
Nplayer stochastic games, we devise a learning protocol that is guaranteed to
converge to equilibrium policies even when the number of agents is extremely
large. Our method is modelfree and completely decentralised so that each agent
need only observe its local state information and its realised rewards. We
validate these theoretical results by showing convergence to Nashequilibrium
policies in applications from economics and control theory with thousands of
strategically interacting agents.

We present a polynomialtime algorithm that always finds an (approximate)
Nash equilibrium for repeated twoplayer stochastic games. The algorithm
exploits the folk theorem to derive a strategy profile that forms an
equilibrium by buttressing mutually beneficial behavior with threats, where
possible. One component of our algorithm efficiently searches for an
approximation of the egalitarian point, the fairest paretoefficient solution.
The paper concludes by applying the algorithm to a set of grid games to
illustrate typical solutions the algorithm finds. These solutions compare very
favorably to those found by competing algorithms, resulting in strategies with
higher social welfare, as well as guaranteed computational efficiency.

Game theory's prescriptive power typically relies on full rationality and/or
selfplay interactions. In contrast, this work sets aside these fundamental
premises and focuses instead on heterogeneous autonomous interactions between
two or more agents. Specifically, we introduce a new and concise representation
for repeated adversarial (constantsum) games that highlight the necessary
features that enable an automated planing agent to reason about how to score
above the game's Nash equilibrium, when facing heterogeneous adversaries. To
this end, we present TeamUP, a modelbased RL algorithm designed for learning
and planning such an abstraction. In essence, it is somewhat similar to Rmax
with a cleverly engineered reward shaping that treats exploration as an
adversarial optimization problem. In practice, it attempts to find an ally with
which to tacitly collude (in more than twoplayer games) and then collaborates
on a joint plan of actions that can consistently score a high utility in
adversarial repeated games. We use the inaugural Lemonade Stand Game Tournament
to demonstrate the effectiveness of our approach, and find that TeamUP is the
best performing agent, demoting the Tournament's actual winning strategy into
second place. In our experimental analysis, we show hat our strategy
successfully and consistently builds collaborations with many different
heterogeneous (and sometimes very sophisticated) adversaries.