
We consider the exploration/exploitation problem in reinforcement learning.
For exploitation, it is well known that the Bellman equation connects the value
at any timestep to the expected value at subsequent timesteps. In this paper
we consider a similar \textit{uncertainty} Bellman equation (UBE), which
connects the uncertainty at any timestep to the expected uncertainties at
subsequent timesteps, thereby extending the potential exploratory benefit of a
policy beyond individual timesteps. We prove that the unique fixed point of
the UBE yields an upper bound on the variance of the posterior distribution of
the Qvalues induced by any policy. This bound can be much tighter than
traditional countbased bonuses that compound standard deviation rather than
variance. Importantly, and unlike several existing approaches to optimism, this
method scales naturally to large systems with complex generalization.
Substituting our UBEexploration strategy for $\epsilon$greedy improves DQN
performance on 51 out of 57 games in the Atari suite.

This paper investigates recently proposed approaches for defending against
adversarial examples and evaluating adversarial robustness. The existence of
adversarial examples in trained neural networks reflects the fact that expected
risk alone does not capture the model's performance against worstcase inputs.
We motivate the use of adversarial risk as an objective, although it cannot
easily be computed exactly. We then frame commonly used attacks and evaluation
metrics as defining a tractable surrogate objective to the true adversarial
risk. This suggests that models may be obscured to adversaries, by optimizing
this surrogate rather than the true adversarial risk. We demonstrate that this
is a significant problem in practice by repurposing gradientfree optimization
techniques into adversarial attacks, which we use to decrease the accuracy of
several recently proposed defenses to near zero. Our hope is that our
formulations and results will help researchers to develop more powerful
defenses.

Policy gradient is an efficient technique for improving a policy in a
reinforcement learning setting. However, vanilla online variants are onpolicy
only and not able to take advantage of offpolicy data. In this paper we
describe a new technique that combines policy gradient with offpolicy
Qlearning, drawing experience from a replay buffer. This is motivated by
making a connection between the fixed points of the regularized policy gradient
algorithm and the Qvalues. This connection allows us to estimate the Qvalues
from the action preferences of the policy, to which we apply Qlearning
updates. We refer to the new technique as 'PGQL', for policy gradient and
Qlearning. We also establish an equivalency between actionvalue fitting
techniques and actorcritic algorithms, showing that regularized policy
gradient techniques can be interpreted as advantage function learning
algorithms. We conclude with some numerical examples that demonstrate improved
data efficiency and stability of PGQL. In particular, we tested PGQL on the
full suite of Atari games and achieved performance exceeding that of both
asynchronous advantage actorcritic (A3C) and Qlearning.

We introduce a first order method for solving very large convex cone
programs. The method uses an operator splitting method, the alternating
directions method of multipliers, to solve the homogeneous selfdual embedding,
an equivalent feasibility problem involving finding a nonzero point in the
intersection of a subspace and a cone. This approach has several favorable
properties. Compared to interiorpoint methods, firstorder methods scale to
very large problems, at the cost of requiring more time to reach very high
accuracy. Compared to other firstorder methods for cone programs, our approach
finds both primal and dual solutions when available or a certificate of
infeasibility or unboundedness otherwise, is parameterfree, and the
periteration cost of the method is the same as applying a splitting method to
the primal or dual alone. We discuss efficient implementation of the method in
detail, including direct and indirect methods for computing projection onto the
subspace, scaling the original problem data, and stopping criteria. We describe
an opensource implementation, which handles the usual (symmetric)
nonnegative, secondorder, and semidefinite cones as well as the
(nonselfdual) exponential and power cones and their duals. We report
numerical results that show speedups over interiorpoint cone solvers for large
problems, and scaling to very large general cone programs.

Convex optimization is a powerful tool for resource allocation and signal
processing in wireless networks. As the network density is expected to
drastically increase in order to accommodate the exponentially growing mobile
data traffic, performance optimization problems are entering a new era
characterized by a high dimension and/or a large number of constraints, which
poses significant design and computational challenges. In this paper, we
present a novel twostage approach to solve largescale convex optimization
problems for dense wireless cooperative networks, which can effectively detect
infeasibility and enjoy modeling flexibility. In the proposed approach, the
original largescale convex problem is transformed into a standard cone
programming form in the first stage via matrix stuffing, which only needs to
copy the problem parameters such as channel state information (CSI) and
qualityofservice (QoS) requirements to the prestored structure of the
standard form. The capability of yielding infeasibility certificates and
enabling parallel computing is achieved by solving the homogeneous selfdual
embedding of the primaldual pair of the standard form. In the solving stage,
the operator splitting method, namely, the alternating direction method of
multipliers (ADMM), is adopted to solve the largescale homogeneous selfdual
embedding. Compared with secondorder methods, ADMM can solve largescale
problems in parallel with modest accuracy within a reasonable amount of time.
Simulation results will demonstrate the speedup, scalability, and reliability
of the proposed framework compared with the stateoftheart modeling
frameworks and solvers.

In this paper we demonstrate a simple heuristic adaptive restart technique
that can dramatically improve the convergence rate of accelerated gradient
schemes. The analysis of the technique relies on the observation that these
schemes exhibit two modes of behavior depending on how much momentum is
applied. In what we refer to as the 'high momentum' regime the iterates
generated by an accelerated gradient scheme exhibit a periodic behavior, where
the period is proportional to the square root of the local condition number of
the objective function. This suggests a restart technique whereby we reset the
momentum whenever we observe periodic behavior. We provide analysis to show
that in many cases adaptively restarting allows us to recover the optimal rate
of convergence with no prior knowledge of function parameters.