
The key challenge in multiagent learning is learning a best response to the
behaviour of other agents, which may be nonstationary: if the other agents
adapt their strategy as well, the learning target moves. Disparate streams of
research have approached nonstationarity from several angles, which make a
variety of implicit assumptions that make it hard to keep an overview of the
state of the art and to validate the innovation and significance of new works.
This survey presents a coherent overview of work that addresses
opponentinduced nonstationarity with tools from game theory, reinforcement
learning and multiarmed bandits. Further, we reflect on the principle
approaches how algorithms model and cope with this nonstationarity, arriving
at a new framework and five categories (in increasing order of sophistication):
ignore, forget, respond to target models, learn models, and theory of mind. A
wide range of stateoftheart algorithms is classified into a taxonomy, using
these categories and key characteristics of the environment (e.g.,
observability) and adaptation behaviour of the opponents (e.g., smooth,
abrupt). To clarify even further we present illustrative variations of one
domain, contrasting the strengths and limitations of each category. Finally, we
discuss in which environments the different approaches yield most merit, and
point to promising avenues of future research.

Collective, especially groupbased, managerial decision making is crucial in
organizations. Using an evolutionary theoretic approach to collective decision
making, agentbased simulations were conducted to investigate how human
collective decision making would be affected by the agents' diversity in
problem understanding and/or behavior in discussion, as well as by their social
network structure. Simulation results indicated that groups with consistent
problem understanding tended to produce higher utility values of ideas and
displayed better decision convergence, but only if there was no grouplevel
bias in collective problem understanding. Simulation results also indicated the
importance of balance between selectionoriented (i.e., exploitative) and
variationoriented (i.e., explorative) behaviors in discussion to achieve
quality final decisions. Expanding the group size and introducing nontrivial
social network structure generally improved the quality of ideas at the cost of
decision convergence. Simulations with different social network topologies
revealed collective decision making on smallworld networks with high local
clustering tended to achieve highest decision quality more often than on random
or scalefree networks. Implications of this evolutionary theory and simulation
approach for future managerial research on collective, group, and multilevel
decision making are discussed.

The beer game is a widely used inclass game that is played in supply chain
management classes to demonstrate the bullwhip effect. The game is a
decentralized, multiagent, cooperative problem that can be modeled as a serial
supply chain network in which agents cooperatively attempt to minimize the
total cost of the network even though each agent can only observe its own local
information. Each agent chooses order quantities to replenish its stock. Under
some conditions, a basestock replenishment policy is known to be optimal.
However, in a decentralized supply chain in which some agents (stages) may act
irrationally (as they do in the beer game), there is no known optimal policy
for an agent wishing to act optimally.
We propose a machine learning algorithm, based on deep Qnetworks, to
optimize the replenishment decisions at a given stage. When playing alongside
agents who follow a basestock policy, our algorithm obtains nearoptimal order
quantities. It performs much better than a basestock policy when the other
agents use a more realistic model of human ordering behavior. Unlike most other
algorithms in the literature, our algorithm does not have any limits on the
beer game parameter values. Like any deep learning algorithm, training the
algorithm can be computationally intensive, but this can be performed ahead of
time; the algorithm executes in real time when the game is played. Moreover, we
propose a transfer learning approach so that the training performed for one
agent and one set of cost coefficients can be adapted quickly for other agents
and costs. Our algorithm can be extended to other decentralized multiagent
cooperative games with partially observed information, which is a common type
of situation in realworld supply chain problems.

Electric vehicles (EVs) are expected to be a major component of the smart
grid. The rapid proliferation of EVs will introduce an unprecedented load on
the existing electric grid due to the charging/discharging behavior of the EVs,
thus motivating the need for novel approaches for routing EVs across the grid.
In this paper, a novel gametheoretic framework for smart routing of EVs within
the smart grid is proposed. The goal of this framework is to balance the
electricity load across the grid while taking into account the traffic
congestion and the waiting time at charging stations. The EV routing problem is
formulated as a noncooperative game. For this game, it is shown that selfish
behavior of EVs will result in a purestrategy Nash equilibrium with the price
of anarchy upper bounded by the variance of the ground load induced by the
residential, industrial, or commercial users. Moreover, the results are
extended to capture the stochastic nature of induced ground load as well as the
subjective behavior of the owners of EVs as captured by using notions from the
behavioral framework of prospect theory. Simulation results provide new
insights on more efficient energy pricing at charging stations and under more
realistic grid conditions.

In this paper, we develop a class of decentralized algorithms for solving a
convex resource allocation problem in a network of $n$ agents, where the agent
objectives are decoupled while the resource constraints are coupled. The agents
communicate over a connected undirected graph, and they want to collaboratively
determine a solution to the overall network problem, while each agent only
communicates with its neighbors. We first study the connection between the
decentralized resource allocation problem and the decentralized consensus
optimization problem. Then, using a class of algorithms for solving consensus
optimization problems, we propose a novel class of decentralized schemes for
solving resource allocation problems in a distributed manner. Specifically, we
first propose an algorithm for solving the resource allocation problem with an
$o(1/k)$ convergence rate guarantee when the agents' objective functions are
generally convex (could be nondifferentiable) and per agent local convex
constraints are allowed; We then propose a gradientbased algorithm for solving
the resource allocation problem when per agent local constraints are absent and
show that such scheme can achieve geometric rate when the objective functions
are strongly convex and have Lipschitz continuous gradients. We have also
provided scalability/network dependency analysis. Based on these two
algorithms, we have further proposed a gradient projectionbased algorithm
which can handle smooth objective and simple constraints more efficiently.
Numerical experiments demonstrates the viability and performance of all the
proposed algorithms.

Many important stable matching problems are known to be NPhard, even when
strong restrictions are placed on the input. In this paper we seek to identify
structural properties of instances of stable matching problems which will allow
us to design efficient algorithms using elementary techniques. We focus on the
setting in which all agents involved in some matching problem can be
partitioned into k different types, where the type of an agent determines his
or her preferences, and agents have preferences over types (which may be
refined by more detailed preferences within a single type). This situation
would arise in practice if agents form preferences solely based on some small
collection of agents' attributes. We also consider a generalisation in which
each agent may consider some small collection of other agents to be
exceptional, and rank these in a way that is not consistent with their types;
this could happen in practice if agents have prior contact with a small number
of candidates. We show that (for the case without exceptions), several
wellstudied NPhard stable matching problems including Max SMTI (that of
finding the maximum cardinality stable matching in an instance of stable
marriage with ties and incomplete lists) belong to the parameterised complexity
class FPT when parameterised by the number of different types of agents needed
to describe the instance. For Max SMTI this tractability result can be extended
to the setting in which each agent promotes at most one `exceptional' candidate
to the top of his/her list (when preferences within types are not refined), but
the problem remains NPhard if preference lists can contain two or more
exceptions and the exceptional candidates can be placed anywhere in the
preference lists, even if the number of types is bounded by a constant.

To well understand crowd behavior, microscopic models have been developed in
recent decades, in which an individual's behavioral/psychological status can be
modeled and simulated. A wellknown model is the socialforce model innovated
by physical scientists (Helbing and Molnar, 1995; Helbing, Farkas and Vicsek,
2000; Helbing et al., 2002). This model has been widely accepted and mainly
used in simulation of crowd evacuation in the past decade. A problem, however,
is that the testing results of the model were not explained in consistency with
the psychological findings, resulting in misunderstanding of the model by
psychologists. This paper will bridge the gap between psychological studies and
physical explanation about this model. We reinterpret this physicsbased model
from a psychological perspective, clarifying that the model is consistent with
psychological theories on stress, including timerelated stress and
interpersonal stress. Based on the conception of stress, we renew the model at
both microandmacro level, referring to multiagent simulation in a
microscopic sense and fluidbased analysis in a macroscopic sense. The
cognition and behavior of individual agents are critically modeled as response
to environmental stimuli. Existing simulation results such as fasterisslower
effect will be reinterpreted by YerkesDodson law, and herding and grouping
effect are further discussed by integrating attraction into the social force.
In brief the socialforce model exhibits a bridge between the physics laws and
psychological principles regarding crowd motion, and this paper will renew and
reinterpret the model on the foundation of psychological studies.

Multitarget tracking is an important problem in civilian and military
applications. This paper investigates multitarget tracking in distributed
sensor networks. Data association, which arises particularly in multiobject
scenarios, can be tackled by various solutions. We consider sequential Monte
Carlo implementations of the Probability Hypothesis Density (PHD) filter based
on random finite sets. This approach circumvents the data association issue by
jointly estimating all targets in the region of interest. To this end, we
develop the Diffusion Particle PHD Filter (DPPHDF) as well as a centralized
version, called the MultiSensor Particle PHD Filter (MSPPHDF). Their
performance is evaluated in terms of the Optimal Subpattern Assignment (OSPA)
metric, benchmarked against a distributed extension of the Posterior
Cram\'erRao Lower Bound (PCRLB), and compared to the performance of an
existing distributed PHD Particle Filter. Furthermore, the robustness of the
proposed tracking algorithms against outliers and their performance with
respect to different amounts of clutter is investigated.

We introduce the problem of assigning resources to improve their utilization.
The motivation comes from settings where agents have uncertainty about their
own values for using a resource, and where it is in the interest of a group
that resources be used and not wasted. Done in the right way, improved
utilization maximizes social welfare balancing the utility of a high value
but unreliable agent with the group's preference that resources be used. We
introduce the family of contingent payment mechanisms (CP), which may charge an
agent contingent on use (a penalty). A CP mechanism is parameterized by a
maximum penalty, and has a dominantstrategy equilibrium. Under a set of
axiomatic properties, we establish welfareoptimality for the special case
CP(W), with CP instantiated for a maximum penalty equal to societal value W for
utilization. CP(W) is not dominated for expected welfare by any other
mechanism, and second, amongst mechanisms that always allocate the resource and
have a simple indirect structure, CP(W) strictly dominates every other
mechanism. The special case with no upper bound on penalty, the contingent
secondprice mechanism, maximizes utilization. We extend the mechanisms to
assign multiple, heterogeneous resources, and present a simulation study of the
welfare properties of these mechanisms.

A fundamental challenge in multiagent systems is to design local control
algorithms to ensure a desirable collective behaviour. The information
available to the agents, gathered either through communication or sensing,
naturally restricts the achievable performance. Hence, it is fundamental to
identify what piece of information is valuable and can be exploited to design
control laws with enhanced performance guarantees. This paper studies the case
when such information is uncertain or inaccessible for a class of submodular
resource allocation problems termed covering problems. In the first part of
this work we pinpoint a fundamental riskreward tradeoff faced by the system
operator when conditioning the control design on a valuable but uncertain piece
of information, which we refer to as the cardinality, that represents the
maximum number of agents that can simultaneously select any given resource.
Building on this analysis, we propose a distributed algorithm that allows
agents to learn the cardinality while adjusting their behaviour over time. This
algorithm is proved to perform on par or better to the optimal design obtained
when the exact cardinality is known a priori.

Many societal decision problems lie in highdimensional continuous spaces not
amenable to the voting techniques common for their discrete or
singledimensional counterparts. These problems are typically discretized
before running an election or decided upon through negotiation by
representatives. We propose a algorithm called {\sc Iterative Local Voting} for
collective decisionmaking in this setting. In this algorithm, voters are
sequentially sampled and asked to modify a candidate solution within some local
neighborhood of its current value, as defined by a ball in some chosen norm,
with the size of the ball shrinking at a specified rate.
We first prove the convergence of this algorithm under appropriate choices of
neighborhoods to Pareto optimal solutions with desirable fairness properties in
certain natural settings: when the voters' utilities can be expressed in terms
of some form of distance from their ideal solution, and when these utilities
are additively decomposable across dimensions. In many of these cases, we
obtain convergence to the societal welfare maximizing solution.
We then describe an experiment in which we test our algorithm for the
decision of the U.S. Federal Budget on Mechanical Turk with over 2,000 workers,
employing neighborhoods defined by $\mathcal{L}^1, \mathcal{L}^2$ and
$\mathcal{L}^\infty$ balls. We make several observations that inform future
implementations of such a procedure.

We consider social welfare functions that satisfy Arrow's classic axioms of
independence of irrelevant alternatives and Pareto optimality when the outcome
space is the convex hull of some finite set of alternatives. Individual and
collective preferences are assumed to be continuous and convex, which
guarantees the existence of maximal elements and the consistency of choice
functions that return these elements, even without insisting on transitivity.
We provide characterizations of both the domains of preferences and the social
welfare functions that allow for anonymous Arrovian aggregation. The domains
admit arbitrary preferences over alternatives, which completely determine an
agent's preferences over all mixed outcomes. On these domains, Arrow's
impossibility turns into a complete characterization of a unique social welfare
function, which can be readily applied in settings involving divisible
resources such as probability, time, or money.

We present two distributed methods for the estimation of the kinematic
parameters, the dynamic parameters, and the kinematic state of an unknown
planar body manipulated by a decentralized multiagent system. The proposed
approaches rely on the rigid body kinematics and dynamics, on nonlinear
observation theory, and on consensus algorithms. The only three requirements
are that each agent can exert a 2D wrench on the load, it can measure the
velocity of its contact point, and that the communication graph is connected.
Both theoretical nonlinear observability analysis and convergence proofs are
provided. The first method assumes constant parameters while the second one can
deal with timevarying parameters and can be applied in parallel to any
taskoriented control law. For the cases in which a control law is not
provided, we propose a distributed and safe control strategy satisfying the
observability condition. The effectiveness and robustness of the estimation
strategy is showcased by means of realistic MonteCarlo simulations.

Due to the complexity of the natural world, a programmer cannot foresee all
possible situations, a connected and autonomous vehicle (CAV) will face during
its operation, and hence, CAVs will need to learn to make decisions
autonomously. Due to the sensing of its surroundings and information exchanged
with other vehicles and road infrastructure, a CAV will have access to large
amounts of useful data. While different control algorithms have been proposed
for CAVs, the benefits brought about by connectedness of autonomous vehicles to
other vehicles and to the infrastructure, and its implications on policy
learning has not been investigated in literature. This paper investigates a
data driven driving policy learning framework through an agentbased modelling
approaches. The contributions of the paper are twofold. A dynamic programming
framework is proposed for invehicle policy learning with and without
connectivity to neighboring vehicles. The simulation results indicate that
while a CAV can learn to make autonomous decisions, vehicletovehicle (V2V)
communication of information improves this capability. Furthermore, to overcome
the limitations of sensing in a CAV, the paper proposes a novel concept for
infrastructureled policy learning and communication with autonomous vehicles.
In infrastructureled policy learning, roadside infrastructure senses and
captures successful vehicle maneuvers and learns an optimal policy from those
temporal sequences, and when a vehicle approaches the roadside unit, the
policy is communicated to the CAV. Deepimitation learning methodology is
proposed to develop such an infrastructureled policy learning framework.

Two fundamental algorithmdesign paradigms are Tree Search and Dynamic
Programming. The techniques used therein have been shown to complement one
another when solving the complete set partitioning problem, also known as the
coalition structure generation problem [5]. Inspired by this observation, we
develop in this paper an algorithm to solve the coalition structure generation
problem on graphs, where the goal is to identifying an optimal partition of a
graph into connected subgraphs. More specifically, we develop a new depthfirst
search algorithm, and combine it with an existing dynamic programming algorithm
due to Vinyals et al. [9]. The resulting hybrid algorithm is empirically shown
to significantly outperform both its constituent parts when the
subsetevaluation function happens to have certain intuitive properties.

This paper presents a methodology for simulating the Internet of Things (IoT)
using multilevel simulation models. With respect to conventional simulators,
this approach allows us to tune the level of detail of different parts of the
model without compromising the scalability of the simulation. As a use case, we
have developed a twolevel simulator to study the deployment of smart services
over rural territories. The higher level is base on a coarse grained,
agentbased adaptive parallel and distributed simulator. When needed, this
simulator spawns OMNeT++ model instances to evaluate in more detail the issues
concerned with wireless communications in restricted areas of the simulated
world. The performance evaluation confirms the viability of multilevel
simulations for IoT environments.

In this paper we consider the problem of identifying intersections between
two sets of ddimensional axisparallel rectangles. This is a common problem
that arises in many agentbased simulation studies, and is of central
importance in the context of High Level Architecture (HLA), where it is at the
core of the Data Distribution Management (DDM) service. Several realizations of
the DDM service have been proposed; however, many of them are either
inefficient or inherently sequential. These are serious limitations since
multicore processors are now ubiquitous, and DDM algorithms  being
CPUintensive  could benefit from additional computing power. We propose a
parallel version of the SortBased Matching algorithm for sharedmemory
multiprocessors. SortBased Matching is one of the most efficient serial
algorithms for the DDM problem, but is quite difficult to parallelize due to
data dependencies. We describe the algorithm and compute its asymptotic running
time; we complete the analysis by assessing its performance and scalability
through extensive experiments on two commodity multicore systems based on a
dual socket Intel Xeon processor, and a single socket Intel Core i7 processor.

The price of anarchy and price of stability are three wellstudied
performance metrics that seek to characterize the inefficiency of equilibria in
distributed systems. The distinction between these two performance metrics
centers on the equilibria that they focus on: the price of anarchy
characterizes the quality of the worstperforming equilibria, while the price
of stability characterizes the quality of the bestperforming equilibria. While
much of the literature focuses on these metrics from an analysis perspective,
in this work we consider these performance metrics from a design perspective.
Specifically, we focus on the setting where a system operator is tasked with
designing local utility functions to optimize these performance metrics in a
class of games termed covering games. Our main result characterizes a
fundamental tradeoff between the price of anarchy and price of stability in
the form of a fully explicit Pareto frontier. Within this setup, optimizing the
price of anarchy comes directly at the expense of the price of stability (and
vice versa). Our second results demonstrates how a systemoperator could
incorporate an additional piece of systemlevel information into the design of
the agents' utility functions to breach these limitations and improve the
system's performance. This valuable piece of systemlevel information pertains
to the performance of worst performing agent in the system.

Swarm systems constitute a challenging problem for reinforcement learning
(RL) as the algorithm needs to learn decentralized control policies that can
cope with limited local sensing and communication abilities of the agents.
While it is often difficult to directly define the behavior of the agents,
simple communication protocols can be defined more easily using prior knowledge
about the given task. In this paper, we propose a number of simple
communication protocols that can be exploited by deep reinforcement learning to
find decentralized control policies in a multirobot swarm environment. The
protocols are based on histograms that encode the local neighborhood relations
of the agents and can also transmit taskspecific information, such as the
shortest distance and direction to a desired target. In our framework, we use
an adaptation of Trust Region Policy Optimization to learn complex
collaborative tasks, such as formation building and building a communication
link. We evaluate our findings in a simulated 2Dphysics environment, and
compare the implications of different communication protocols.

We study the computations that Bayesian agents undertake when exchanging
opinions over a network. The agents act repeatedly on their private information
and take myopic actions that maximize their expected utility according to a
fully rational posterior belief. We show that such computations are NPhard for
two natural utility functions: one with binary actions, and another where
agents reveal their posterior beliefs. In fact, we show that distinguishing
between posteriors that are concentrated on different states of the world is
NPhard. Therefore, even approximating the Bayesian posterior beliefs is hard.
We also describe a natural search algorithm to compute agents' actions, which
we call elimination of impossible signals, and show that if the network is
transitive, the algorithm can be modified to run in polynomial time.

Two fundamental problems in computational game theory are computing a Nash
equilibrium and learning to exploit opponents given observations of their play
(opponent exploitation). The latter is perhaps even more important than the
former: Nash equilibrium does not have a compelling theoretical justification
in game classes other than twoplayer zerosum, and for all games one can
potentially do better by exploiting perceived weaknesses of the opponent than
by following a static equilibrium strategy throughout the match. The natural
setting for opponent exploitation is the Bayesian setting where we have a prior
model that is integrated with observations to create a posterior opponent model
that we respond to. The most natural, and a wellstudied prior distribution is
the Dirichlet distribution. An exact polynomialtime algorithm is known for
bestresponding to the posterior distribution for an opponent assuming a
Dirichlet prior with multinomial sampling in normalform games; however, for
imperfectinformation games the best known algorithm is based on approximating
an infinite integral without theoretical guarantees. We present the first exact
algorithm for a natural class of imperfectinformation games. We demonstrate
that our algorithm runs quickly in practice and outperforms the best prior
approaches. We also present an algorithm for the uniform prior setting.

We study the interaction between a fleet of electric, selfdriving vehicles
servicing ondemand transportation requests (referred to as Autonomous
MobilityonDemand, or AMoD, system) and the electric power network. We propose
a model that captures the coupling between the two systems stemming from the
vehicles' charging requirements and captures timevarying customer demand and
power generation costs, road congestion, battery depreciation, and power
transmission and distribution constraints. We then leverage the model to
jointly optimize the operation of both systems. We devise an algorithmic
procedure to losslessly reduce the problem size by bundling customer requests,
allowing it to be efficiently solved by offtheshelf linear programming
solvers. Next, we show that the socially optimal solution to the joint problem
can be enforced as a general equilibrium, and we provide a dual decomposition
algorithm that allows selfinterested agents to compute the market clearing
prices without sharing private information. We assess the performance of the
mode by studying a hypothetical AMoD system in DallasFort Worth and its impact
on the Texas power network. Lack of coordination between the AMoD system and
the power network can cause a 4.4% increase in the price of electricity in
DallasFort Worth; conversely, coordination between the AMoD system and the
power network could reduce electricity expenditure compared to the case where
no cars are present (despite the increased demand for electricity) and yield
savings of up $147M/year. Finally, we provide a recedinghorizon implementation
and assess its performance with agentbased simulations. Collectively, the
results of this paper provide a firstofakind characterization of the
interaction between electricpowered AMoD systems and the power network, and
shed additional light on the economic and societal value of AMoD.

In this paper, we develop a distributed intermittent communication and task
planning framework for mobile robot teams. The goal of the robots is to
accomplish complex tasks, captured by local Linear Temporal Logic formulas, and
share the collected information with all other robots and possibly also with a
user. Specifically, we consider situations where the robot communication
capabilities are not sufficient to form reliable and connected networks while
the robots move to accomplish their tasks. In this case, intermittent
communication protocols are necessary that allow the robots to temporarily
disconnect from the network in order to accomplish their tasks free of
communication constraints. We assume that the robots can only communicate with
each other when they meet at common locations in space. Our distributed control
framework jointly determines local plans that allow all robots fulfill their
assigned temporal tasks, sequences of communication events that guarantee
information exchange infinitely often, and optimal communication locations that
minimize a desired distance metric. Simulation results verify the efficacy of
the proposed controllers.

This article introduces new tools to study selforganisation in a family of
simple cellular automata which contain some particlelike objects with good
collision properties (coalescence) in their time evolution. We draw an initial
configuration at random according to some initial $\sigma$ergodic measure
$\mu$, and use the limit measure to descrbe the asymptotic behaviour of the
automata. We first take a qualitative approach, i.e. we obtain information on
the limit measure(s). We prove that only particles moving in one particular
direction can persist asymptotically. This provides some previously unknown
information on the limit measures of various deterministic and probabilistic
cellular automata: 3 and 4cyclic cellular automata (introduced in [Fis90b]),
onesided captive cellular automata (introduced in [The04]), N. Fat{\`e}s'
candidate to solve the density classification problem [Fat13], self
stabilization process toward a discrete line [RR15]... In a second time we
restrict our study to to a subclass, the gliders cellular automata. For this
class we show quantitative results, consisting in the asymptotic law of some
parameters: the entry times (generalising [KFD11]), the density of particles
and the rate of convergence to the limit measure.

Intelligent systems for the annotation of media content are increasingly
being used for the automation of parts of social science research. In this
domain the problem of integrating various Artificial Intelligence (AI)
algorithms into a single intelligent system arises spontaneously. As part of
our ongoing effort in automating media content analysis for the social
sciences, we have built a modular system by combining multiple AI modules into
a flexible framework in which they can cooperate in complex tasks. Our system
combines data gathering, machine translation, topic classification, extraction
and annotation of entities and social networks, as well as many other tasks
that have been perfected over the past years of AI research. Over the last few
years, it has allowed us to realise a series of scientific studies over a vast
range of applications including comparative studies between news outlets and
media content in different countries, modelling of user preferences, and
monitoring public mood. The framework is flexible and allows the design and
implementation of modular agents, where simple modules cooperate in the
annotation of a large dataset without central coordination.