
We study the use of randomized value functions to guide deep exploration in
reinforcement learning. This offers an elegant means for synthesizing
statistically and computationally efficient exploration with common practical
approaches to value function learning. We present several reinforcement
learning algorithms that leverage randomized value functions and demonstrate
their efficacy through computational studies. We also prove a regret bound that
establishes statistical efficiency with a tabular representation.

This paper considers the optimal adaptive allocation of measurement effort
for identifying the best among a finite set of options or designs. An
experimenter sequentially chooses designs to measure and observes noisy signals
of their quality with the goal of confidently identifying the best design after
a small number of measurements. This paper proposes three simple and intuitive
Bayesian algorithms for adaptively allocating measurement effort, and
formalizes a sense in which these seemingly naive rules are the best possible.
One proposal is toptwo probability sampling, which computes the two designs
with the highest posterior probability of being optimal, and then randomizes to
select among these two. One is a variant of toptwo sampling which considers
not only the probability a design is optimal, but the expected amount by which
its quality exceeds that of other designs. The final algorithm is a modified
version of Thompson sampling that is tailored for identifying the best design.
We prove that these simple algorithms satisfy a sharp optimality property. In a
frequentist setting where the true quality of the designs is fixed, one hopes
the posterior definitively identifies the optimal design, in the sense that
that the posterior probability assigned to the event that some other design is
optimal converges to zero as measurements are collected. We show that under the
proposed algorithms this convergence occurs at an exponential rate, and the
corresponding exponent is the best possible among all allocation

Much of the recent literature on bandit learning focuses on algorithms that
aim to converge on an optimal action. One shortcoming is that this orientation
does not account for time sensitivity, which can play a crucial role when
learning an optimal action requires much more information than nearoptimal
ones. Indeed, popular approaches such as upperconfidencebound methods and
Thompson sampling can fare poorly in such situations. We consider instead
learning a satisficing action, which is nearoptimal while requiring less
information, and propose satisficing Thompson sampling, an algorithm that
serves this purpose. We establish a general bound on expected discounted regret
and study the application of satisficing Thompson sampling to linear and
infinitearmed bandits, demonstrating arbitrarily large benefits over Thompson
sampling. We also discuss the relation between the notion of satisficing and
the theory of rate distortion, which offers guidance on the selection of
satisficing actions.

Thompson sampling is an algorithm for online decision problems where actions
are taken sequentially in a manner that must balance between exploiting what is
known to maximize immediate performance and investing to accumulate new
information that may improve future performance. The algorithm addresses a
broad range of problems in a computationally efficient manner and is therefore
enjoying wide use. This tutorial covers the algorithm and its application,
illustrating concepts through a range of examples, including Bernoulli bandit
problems, shortest path problems, dynamic pricing, recommendation, active
learning with neural networks, and reinforcement learning in Markov decision
processes. Most of these problems involve complex information structures, where
information revealed by taking an action informs beliefs about other actions.
We will also discuss when and why Thompson sampling is or is not effective and
relations to alternative algorithms.

We propose informationdirected sampling  a new approach to online
optimization problems in which a decisionmaker must balance between
exploration and exploitation while learning from partial feedback. Each action
is sampled in a manner that minimizes the ratio between squared expected
singleperiod regret and a measure of information gain: the mutual information
between the optimal action and the next observation. We establish an expected
regret bound for informationdirected sampling that applies across a very
general class of models and scales with the entropy of the optimal action
distribution. We illustrate through simple analytic examples how
informationdirected sampling accounts for kinds of information that
alternative approaches do not adequately address and that this can lead to
dramatic performance gains. For the widely studied Bernoulli, Gaussian, and
linear bandit problems, we demonstrate stateoftheart simulation performance.

The expected improvement (EI) algorithm is a popular strategy for information
collection in optimization under uncertainty. The algorithm is widely known to
be too greedy, but nevertheless enjoys wide use due to its simplicity and
ability to handle uncertainty and noise in a coherent decision theoretic
framework. To provide rigorous insight into EI, we study its properties in a
simple setting of Bayesian optimization where the domain consists of a finite
grid of points. This is the socalled bestarm identification problem, where
the goal is to allocate measurement effort wisely to confidently identify the
best arm using a small number of measurements. In this framework, one can show
formally that EI is far from optimal. To overcome this shortcoming, we
introduce a simple modification of the expected improvement algorithm.
Surprisingly, this simple change results in an algorithm that is asymptotically
optimal for Gaussian bestarm identification problems, and provably outperforms
standard EI by an order of magnitude.

The literature on bandit learning and regret analysis has focused on contexts
where the goal is to converge on an optimal action in a manner that limits
exploration costs. One shortcoming imposed by this orientation is that it does
not treat time preference in a coherent manner. Time preference plays an
important role when the optimal action is costly to learn relative to
nearoptimal actions. This limitation has not only restricted the relevance of
theoretical results but has also influenced the design of algorithms. Indeed,
popular approaches such as Thompson sampling and UCB can fare poorly in such
situations. In this paper, we consider discounted rather than cumulative
regret, where a discount factor encodes time preference. We propose satisficing
Thompson sampling  a variation of Thompson sampling  and establish a strong
discounted regret bound for this new algorithm.

Modern data is messy and highdimensional, and it is often not clear a priori
what are the right questions to ask. Instead, the analyst typically needs to
use the data to search for interesting analyses to perform and hypotheses to
test. This is an adaptive process, where the choice of analysis to be performed
next depends on the results of the previous analyses on the same data.
Ultimately, which results are reported can be heavily influenced by the data.
It is widely recognized that this process, even if wellintentioned, can lead
to biases and false discoveries, contributing to the crisis of reproducibility
in science. But while %the adaptive nature of exploration any dataexploration
renders standard statistical theory invalid, experience suggests that different
types of exploratory analysis can lead to disparate levels of bias, and the
degree of bias also depends on the particulars of the data set. In this paper,
we propose a general information usage framework to quantify and provably bound
the bias and other error metrics of an arbitrary exploratory analysis. We prove
that our mutual information based bound is tight in natural settings, and then
use it to give rigorous insights into when commonly used procedures do or do
not lead to substantially biased estimation. Through the lens of information
usage, we analyze the bias of specific exploration procedures such as
filtering, rank selection and clustering. Our general framework also naturally
motivates randomization techniques that provably reduces exploration bias while
preserving the utility of the data analysis. We discuss the connections between
our approach and related ideas from differential privacy and blinded data
analysis, and supplement our results with illustrative simulations.

We provide an informationtheoretic analysis of Thompson sampling that
applies across a broad range of online optimization problems in which a
decisionmaker must learn from partial feedback. This analysis inherits the
simplicity and elegance of information theory and leads to regret bounds that
scale with the entropy of the optimalaction distribution. This strengthens
preexisting results and yields new insight into how information improves
performance.

This paper considers the use of a simple posterior sampling algorithm to
balance between exploration and exploitation when learning to optimize actions
such as in multiarmed bandit problems. The algorithm, also known as Thompson
Sampling, offers significant advantages over the popular upper confidence bound
(UCB) approach, and can be applied to problems with finite or infinite action
spaces and complicated relationships among action rewards. We make two
theoretical contributions. The first establishes a connection between posterior
sampling and UCB algorithms. This result lets us convert regret bounds
developed for UCB algorithms into Bayesian regret bounds for posterior
sampling. Our second theoretical contribution is a Bayesian regret bound for
posterior sampling that applies broadly and can be specialized to many model
classes. This bound depends on a new notion we refer to as the eluder
dimension, which measures the degree of dependence among action rewards.
Compared to UCB algorithm Bayesian regret bounds for specific model classes,
our general bound matches the best available for linear models and is stronger
than the best available for generalized linear models. Further, our analysis
provides insight into performance advantages of posterior sampling, which are
highlighted through simulation results that demonstrate performance surpassing
recently proposed UCB algorithms.

Most provablyefficient learning algorithms introduce optimism about
poorlyunderstood states and actions to encourage exploration. We study an
alternative approach for efficient exploration, posterior sampling for
reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of
known duration. At the start of each episode, PSRL updates a prior distribution
over Markov decision processes and takes one sample from this posterior. PSRL
then follows the policy that is optimal for this sample during the episode. The
algorithm is conceptually simple, computationally efficient and allows an agent
to encode prior knowledge in a natural way. We establish an $\tilde{O}(\tau S
\sqrt{AT})$ bound on the expected regret, where $T$ is time, $\tau$ is the
episode length and $S$ and $A$ are the cardinalities of the state and action
spaces. This bound is one of the first for an algorithm not based on optimism,
and close to the state of the art for any reinforcement learning algorithm. We
show through simulation that PSRL significantly outperforms existing algorithms
with similar regret bounds.