
Bandit structured prediction describes a stochastic optimization framework
where learning is performed from partial feedback. This feedback is received in
the form of a task loss evaluation to a predicted output structure, without
having access to gold standard structures. We advance this framework by lifting
linear bandit learning to neural sequencetosequence learning problems using
attentionbased recurrent neural networks. Furthermore, we show how to
incorporate control variates into our learning algorithms for variance
reduction and improved generalization. We present an evaluation on a neural
machine translation task that shows improvements of up to 5.89 BLEU points for
domain adaptation from simulated bandit feedback.

The goal of counterfactual learning for statistical machine translation (SMT)
is to optimize a target SMT system from logged data that consist of user
feedback to translations that were predicted by another, historic SMT system. A
challenge arises by the fact that riskaverse commercial SMT systems
deterministically log the most probable translation. The lack of sufficient
exploration of the SMT output space seemingly contradicts the theoretical
requirements for counterfactual learning. We show that counterfactual learning
from deterministic bandit logs is possible nevertheless by smoothing out
deterministic components in learning. This can be achieved by additive and
multiplicative control variates that avoid degenerate behavior in empirical
risk minimization. Our simulation experiments show improvements of up to 2 BLEU
points by counterfactual learning from deterministic bandit feedback.

We introduce and describe the results of a novel shared task on bandit
learning for machine translation. The task was organized jointly by Amazon and
Heidelberg University for the first time at the Second Conference on Machine
Translation (WMT 2017). The goal of the task is to encourage research on
learning machine translation from weak user feedback instead of human
references or postedits. On each of a sequence of rounds, a machine
translation system is required to propose a translation for an input, and
receives a realvalued estimate of the quality of the proposed translation for
learning. This paper describes the shared task's learning and evaluation setup,
using services hosted on Amazon Web Services (AWS), the data and evaluation
metrics, and the results of various machine translation architectures and
learning protocols.

Stochastic structured prediction under bandit feedback follows a learning
protocol where on each of a sequence of iterations, the learner receives an
input, predicts an output structure, and receives partial feedback in form of a
task loss evaluation of the predicted structure. We present applications of
this learning scenario to convex and nonconvex objectives for structured
prediction and analyze them as stochastic firstorder methods. We present an
experimental evaluation on problems of natural language processing over
exponential output spaces, and compare convergence speed across different
objectives under the practical criterion of optimal task performance on
development data and the optimizationtheoretic criterion of minimal squared
gradient norm. Best results under both criteria are obtained for a nonconvex
objective for pairwise preference learning under bandit feedback.

We present an approach to structured prediction from bandit feedback, called
Bandit Structured Prediction, where only the value of a task loss function at a
single predicted point, instead of a correct structure, is observed in
learning. We present an application to discriminative reranking in Statistical
Machine Translation (SMT) where the learning algorithm only has access to a
1BLEU loss evaluation of a predicted translation instead of obtaining a gold
standard reference translation. In our experiment bandit feedback is obtained
by evaluating BLEU on reference translations without revealing them to the
algorithm. This can be thought of as a simulation of interactive machine
translation where an SMT system is personalized by a user who provides single
point feedback to predicted translations. Our experiments show that our
approach improves translation quality and is comparable to approaches that
employ more informative feedback in learning.