
Bandit structured prediction describes a stochastic optimization framework
where learning is performed from partial feedback. This feedback is received in
the form of a task loss evaluation to a predicted output structure, without
having access to gold standard structures. We advance this framework by lifting
linear bandit learning to neural sequencetosequence learning problems using
attentionbased recurrent neural networks. Furthermore, we show how to
incorporate control variates into our learning algorithms for variance
reduction and improved generalization. We present an evaluation on a neural
machine translation task that shows improvements of up to 5.89 BLEU points for
domain adaptation from simulated bandit feedback.

We present an approach to interactivepredictive neural machine translation
that attempts to reduce human effort from three directions: Firstly, instead of
requiring humans to select, correct, or delete segments, we employ the idea of
learning from human reinforcements in form of judgments on the quality of
partial translations. Secondly, human effort is further reduced by using the
entropy of word predictions as uncertainty criterion to trigger feedback
requests. Lastly, online updates of the model parameters after every
interaction allow the model to adapt quickly. We show in simulation experiments
that reward signals on partial translations significantly improve character
Fscore and BLEU compared to feedback on full translations only, while human
effort can be reduced to an average number of $5$ feedback requests for every
input.

Counterfactual learning from human bandit feedback describes a scenario where
user feedback on the quality of outputs of a historic system is logged and used
to improve a target system. We show how to apply this learning framework to
neural semantic parsing. From a machine learning perspective, the key challenge
lies in a proper reweighting of the estimator so as to avoid known degeneracies
in counterfactual learning, while still being applicable to stochastic gradient
optimization. To conduct experiments with human users, we devise an easytouse
interface to collect human feedback on semantic parses. Our work is the first
to show that semantic parsers can be improved significantly by counterfactual
learning from logged human feedback data.

We present the first realworld application of methods for improving neural
machine translation (NMT) with human reinforcement, based on explicit and
implicit user feedback collected on the eBay ecommerce platform. Previous work
has been confined to simulation experiments, whereas in this paper we work with
real logged feedback for offline bandit learning of NMT parameters. We conduct
a thorough analysis of the available explicit user judgmentsfivestar
ratings of translation qualityand show that they are not reliable enough to
yield significant improvements in bandit learning. In contrast, we successfully
utilize implicit taskbased feedback collected in a crosslingual search task
to improve taskspecific and machine translation quality metrics.

The goal of counterfactual learning for statistical machine translation (SMT)
is to optimize a target SMT system from logged data that consist of user
feedback to translations that were predicted by another, historic SMT system. A
challenge arises by the fact that riskaverse commercial SMT systems
deterministically log the most probable translation. The lack of sufficient
exploration of the SMT output space seemingly contradicts the theoretical
requirements for counterfactual learning. We show that counterfactual learning
from deterministic bandit logs is possible nevertheless by smoothing out
deterministic components in learning. This can be achieved by additive and
multiplicative control variates that avoid degenerate behavior in empirical
risk minimization. Our simulation experiments show improvements of up to 2 BLEU
points by counterfactual learning from deterministic bandit feedback.

We introduce and describe the results of a novel shared task on bandit
learning for machine translation. The task was organized jointly by Amazon and
Heidelberg University for the first time at the Second Conference on Machine
Translation (WMT 2017). The goal of the task is to encourage research on
learning machine translation from weak user feedback instead of human
references or postedits. On each of a sequence of rounds, a machine
translation system is required to propose a translation for an input, and
receives a realvalued estimate of the quality of the proposed translation for
learning. This paper describes the shared task's learning and evaluation setup,
using services hosted on Amazon Web Services (AWS), the data and evaluation
metrics, and the results of various machine translation architectures and
learning protocols.

Stochastic structured prediction under bandit feedback follows a learning
protocol where on each of a sequence of iterations, the learner receives an
input, predicts an output structure, and receives partial feedback in form of a
task loss evaluation of the predicted structure. We present applications of
this learning scenario to convex and nonconvex objectives for structured
prediction and analyze them as stochastic firstorder methods. We present an
experimental evaluation on problems of natural language processing over
exponential output spaces, and compare convergence speed across different
objectives under the practical criterion of optimal task performance on
development data and the optimizationtheoretic criterion of minimal squared
gradient norm. Best results under both criteria are obtained for a nonconvex
objective for pairwise preference learning under bandit feedback.

We present an approach to improve statistical machine translation of image
descriptions by multimodal pivots defined in visual space. The key idea is to
perform image retrieval over a database of images that are captioned in the
target language, and use the captions of the most similar images for
crosslingual reranking of translation outputs. Our approach does not depend on
the availability of large amounts of indomain parallel data, but only relies
on available large datasets of monolingually captioned images, and on
stateoftheart convolutional neural networks to compute image similarities.
Our experimental evaluation shows improvements of 1 BLEU point over strong
baselines.

We present an approach to structured prediction from bandit feedback, called
Bandit Structured Prediction, where only the value of a task loss function at a
single predicted point, instead of a correct structure, is observed in
learning. We present an application to discriminative reranking in Statistical
Machine Translation (SMT) where the learning algorithm only has access to a
1BLEU loss evaluation of a predicted translation instead of obtaining a gold
standard reference translation. In our experiment bandit feedback is obtained
by evaluating BLEU on reference translations without revealing them to the
algorithm. This can be thought of as a simulation of interactive machine
translation where an SMT system is personalized by a user who provides single
point feedback to predicted translations. Our experiments show that our
approach improves translation quality and is comparable to approaches that
employ more informative feedback in learning.

We present a new approach to stochastic modeling of constraintbased grammars
that is based on loglinear models and uses EM for estimation from unannotated
data. The techniques are applied to an LFG grammar for German. Evaluation on an
exact match task yields 86% precision for an ambiguity rate of 5.4, and 90%
precision on a subcat frame match for an ambiguity rate of 25. Experimental
comparison to training from a parsebank shows a 10% gain from EM training.
Also, a new classbased grammar lexicalization is presented, showing a 10% gain
over unlexicalized models.

This paper presents the use of probabilistic classbased lexica for
disambiguation in targetword selection. Our method employs minimal but precise
contextual information for disambiguation. That is, only information provided
by the targetverb, enriched by the condensed information of a probabilistic
classbased lexicon, is used. Induction of classes and finetuning to verbal
arguments is done in an unsupervised manner by EMbased clustering techniques.
The method shows promising results in an evaluation on realworld translations.

In this thesis, we present two approaches to a rigorous mathematical and
algorithmic foundation of quantitative and statistical inference in
constraintbased natural language processing. The first approach, called
quantitative constraint logic programming, is conceptualized in a clear logical
framework, and presents a sound and complete system of quantitative inference
for definite clauses annotated with subjective weights. This approach combines
a rigorous formal semantics for quantitative inference based on subjective
weights with efficient weightbased pruning for constraintbased systems. The
second approach, called probabilistic constraint logic programming, introduces
a loglinear probability distribution on the proof trees of a constraint logic
program and an algorithm for statistical inference of the parameters and
properties of such probability models from incomplete, i.e., unparsed data. The
possibility of defining arbitrary properties of proof trees as properties of
the loglinear probability model and efficiently estimating appropriate
parameter values for them permits the probabilistic modeling of arbitrary
contextdependencies in constraint logic programs. The usefulness of these
ideas is evaluated empirically in a smallscale experiment on finding the
correct parses of a constraintbased grammar. In addition, we address the
problem of computational intractability of the calculation of expectations in
the inference task and present various techniques to approximately solve this
task. Moreover, we present an approximate heuristic technique for searching for
the most probable analysis in probabilistic constraint logic programs.

Loglinear models provide a statistically sound framework for Stochastic
``UnificationBased'' Grammars (SUBGs) and stochastic versions of other kinds
of grammars. We describe two computationallytractable ways of estimating the
parameters of such grammars from a training corpus of syntactic analyses, and
apply these to estimate a stochastic version of LexicalFunctional Grammar.

This paper describes a method for estimating conditional probability
distributions over the parses of ``unificationbased'' grammars which can
utilize auxiliary distributions that are estimated by other means. We show how
this can be used to incorporate information about lexical selectional
preferences gathered from other sources into Stochastic ``Unificationbased''
Grammars (SUBGs). While we apply this estimator to a Stochastic
LexicalFunctional Grammar, the method is general, and should be applicable to
stochastic versions of HPSGs, categorial grammars and transformational
grammars.

We present a technique for automatic induction of slot annotations for
subcategorization frames, based on induction of hidden classes in the EM
framework of statistical estimation. The models are empirically evalutated by a
general decision test. Induction of slot labeling for subcategorization frames
is accomplished by a further application of EM, and applied experimentally on
frame observations derived from parsing large corpora. We outline an
interpretation of the learned representations as theoreticallinguistic
decompositional lexical entries.

The paper describes an extensive experiment in insideoutside estimation of a
lexicalized probabilistic context free grammar for German verbfinal clauses.
Grammar and formalism features which make the experiment feasible are
described. Successive models are evaluated on precision and recall of phrase
markup.

We present a probabilistic model for constraintbased grammars and a method
for estimating the parameters of such models from incomplete, i.e., unparsed
data. Whereas methods exist to estimate the parameters of probabilistic
contextfree grammars from incomplete data (Baum 1970), so far for
probabilistic grammars involving contextdependencies only parameter estimation
techniques from complete, i.e., fully parsed data have been presented (Abney
1997). However, completedata estimation requires laborintensive, errorprone,
and grammarspecific handannotating of large language corpora. We present a
loglinear probability model for constraint logic programming, and a general
algorithm to estimate the parameters of such models from incomplete data by
extending the estimation algorithm of DellaPietra, DellaPietra, and Lafferty
(1997) to incomplete data settings.

This paper addresses two central problems for probabilistic processing
models: parameter estimation from incomplete data and efficient retrieval of
most probable analyses. These questions have been answered satisfactorily only
for probabilistic regular and contextfree models. We address these problems
for a more expressive probabilistic constraint logic programming model. We
present a loglinear probability model for probabilistic constraint logic
programming. On top of this model we define an algorithm to estimate the
parameters and to select the properties of loglinear models from incomplete
data. This algorithm is an extension of the improved iterative scaling
algorithm of DellaPietra, DellaPietra, and Lafferty (1995). Our algorithm
applies to loglinear models in general and is accompanied with suitable
approximation methods when applied to large data spaces. Furthermore, we
present an approach for searching for most probable analyses of the
probabilistic constraint logic programming model. This method can be applied to
the ambiguity resolution problem in natural language processing applications.

Constraint logic grammars provide a powerful formalism for expressing complex
logical descriptions of natural language phenomena in exact terms. Describing
some of these phenomena may, however, require some form of graded distinctions
which are not provided by such grammars. Recent approaches to weighted
constraint logic grammars attempt to address this issue by adding numerical
calculation schemata to the deduction scheme of the underlying CLP framework.
Currently, these extralogical extensions are not related to the modeltheoretic
counterpart of the operational semantics of CLP, i.e., they do not come with a
formal semantics at all. The aim of this paper is to present a clear formal
semantics for weighted constraint logic grammars, which abstracts away from
specific interpretations of weights, but nevertheless gives insights into the
parsing problem for such weighted grammars. Building on the formalization of
constraint logic grammars in the CLP scheme of Hoehfeld and Smolka 1988, this
formal semantics will be given by a quantitative version of CLP. Such a
quantitative CLP scheme can also be valuable for CLP tasks independent of
grammars.