• Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback. This feedback is received in the form of a task loss evaluation to a predicted output structure, without having access to gold standard structures. We advance this framework by lifting linear bandit learning to neural sequence-to-sequence learning problems using attention-based recurrent neural networks. Furthermore, we show how to incorporate control variates into our learning algorithms for variance reduction and improved generalization. We present an evaluation on a neural machine translation task that shows improvements of up to 5.89 BLEU points for domain adaptation from simulated bandit feedback.
  • We present an approach to interactive-predictive neural machine translation that attempts to reduce human effort from three directions: Firstly, instead of requiring humans to select, correct, or delete segments, we employ the idea of learning from human reinforcements in form of judgments on the quality of partial translations. Secondly, human effort is further reduced by using the entropy of word predictions as uncertainty criterion to trigger feedback requests. Lastly, online updates of the model parameters after every interaction allow the model to adapt quickly. We show in simulation experiments that reward signals on partial translations significantly improve character F-score and BLEU compared to feedback on full translations only, while human effort can be reduced to an average number of $5$ feedback requests for every input.
  • Counterfactual learning from human bandit feedback describes a scenario where user feedback on the quality of outputs of a historic system is logged and used to improve a target system. We show how to apply this learning framework to neural semantic parsing. From a machine learning perspective, the key challenge lies in a proper reweighting of the estimator so as to avoid known degeneracies in counterfactual learning, while still being applicable to stochastic gradient optimization. To conduct experiments with human users, we devise an easy-to-use interface to collect human feedback on semantic parses. Our work is the first to show that semantic parsers can be improved significantly by counterfactual learning from logged human feedback data.
  • We present the first real-world application of methods for improving neural machine translation (NMT) with human reinforcement, based on explicit and implicit user feedback collected on the eBay e-commerce platform. Previous work has been confined to simulation experiments, whereas in this paper we work with real logged feedback for offline bandit learning of NMT parameters. We conduct a thorough analysis of the available explicit user judgments---five-star ratings of translation quality---and show that they are not reliable enough to yield significant improvements in bandit learning. In contrast, we successfully utilize implicit task-based feedback collected in a cross-lingual search task to improve task-specific and machine translation quality metrics.
  • The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another, historic SMT system. A challenge arises by the fact that risk-averse commercial SMT systems deterministically log the most probable translation. The lack of sufficient exploration of the SMT output space seemingly contradicts the theoretical requirements for counterfactual learning. We show that counterfactual learning from deterministic bandit logs is possible nevertheless by smoothing out deterministic components in learning. This can be achieved by additive and multiplicative control variates that avoid degenerate behavior in empirical risk minimization. Our simulation experiments show improvements of up to 2 BLEU points by counterfactual learning from deterministic bandit feedback.
  • We introduce and describe the results of a novel shared task on bandit learning for machine translation. The task was organized jointly by Amazon and Heidelberg University for the first time at the Second Conference on Machine Translation (WMT 2017). The goal of the task is to encourage research on learning machine translation from weak user feedback instead of human references or post-edits. On each of a sequence of rounds, a machine translation system is required to propose a translation for an input, and receives a real-valued estimate of the quality of the proposed translation for learning. This paper describes the shared task's learning and evaluation setup, using services hosted on Amazon Web Services (AWS), the data and evaluation metrics, and the results of various machine translation architectures and learning protocols.
  • Stochastic structured prediction under bandit feedback follows a learning protocol where on each of a sequence of iterations, the learner receives an input, predicts an output structure, and receives partial feedback in form of a task loss evaluation of the predicted structure. We present applications of this learning scenario to convex and non-convex objectives for structured prediction and analyze them as stochastic first-order methods. We present an experimental evaluation on problems of natural language processing over exponential output spaces, and compare convergence speed across different objectives under the practical criterion of optimal task performance on development data and the optimization-theoretic criterion of minimal squared gradient norm. Best results under both criteria are obtained for a non-convex objective for pairwise preference learning under bandit feedback.
  • We present an approach to improve statistical machine translation of image descriptions by multimodal pivots defined in visual space. The key idea is to perform image retrieval over a database of images that are captioned in the target language, and use the captions of the most similar images for crosslingual reranking of translation outputs. Our approach does not depend on the availability of large amounts of in-domain parallel data, but only relies on available large datasets of monolingually captioned images, and on state-of-the-art convolutional neural networks to compute image similarities. Our experimental evaluation shows improvements of 1 BLEU point over strong baselines.
  • We present an approach to structured prediction from bandit feedback, called Bandit Structured Prediction, where only the value of a task loss function at a single predicted point, instead of a correct structure, is observed in learning. We present an application to discriminative reranking in Statistical Machine Translation (SMT) where the learning algorithm only has access to a 1-BLEU loss evaluation of a predicted translation instead of obtaining a gold standard reference translation. In our experiment bandit feedback is obtained by evaluating BLEU on reference translations without revealing them to the algorithm. This can be thought of as a simulation of interactive machine translation where an SMT system is personalized by a user who provides single point feedback to predicted translations. Our experiments show that our approach improves translation quality and is comparable to approaches that employ more informative feedback in learning.
  • We present a new approach to stochastic modeling of constraint-based grammars that is based on log-linear models and uses EM for estimation from unannotated data. The techniques are applied to an LFG grammar for German. Evaluation on an exact match task yields 86% precision for an ambiguity rate of 5.4, and 90% precision on a subcat frame match for an ambiguity rate of 25. Experimental comparison to training from a parsebank shows a 10% gain from EM training. Also, a new class-based grammar lexicalization is presented, showing a 10% gain over unlexicalized models.
  • This paper presents the use of probabilistic class-based lexica for disambiguation in target-word selection. Our method employs minimal but precise contextual information for disambiguation. That is, only information provided by the target-verb, enriched by the condensed information of a probabilistic class-based lexicon, is used. Induction of classes and fine-tuning to verbal arguments is done in an unsupervised manner by EM-based clustering techniques. The method shows promising results in an evaluation on real-world translations.
  • In this thesis, we present two approaches to a rigorous mathematical and algorithmic foundation of quantitative and statistical inference in constraint-based natural language processing. The first approach, called quantitative constraint logic programming, is conceptualized in a clear logical framework, and presents a sound and complete system of quantitative inference for definite clauses annotated with subjective weights. This approach combines a rigorous formal semantics for quantitative inference based on subjective weights with efficient weight-based pruning for constraint-based systems. The second approach, called probabilistic constraint logic programming, introduces a log-linear probability distribution on the proof trees of a constraint logic program and an algorithm for statistical inference of the parameters and properties of such probability models from incomplete, i.e., unparsed data. The possibility of defining arbitrary properties of proof trees as properties of the log-linear probability model and efficiently estimating appropriate parameter values for them permits the probabilistic modeling of arbitrary context-dependencies in constraint logic programs. The usefulness of these ideas is evaluated empirically in a small-scale experiment on finding the correct parses of a constraint-based grammar. In addition, we address the problem of computational intractability of the calculation of expectations in the inference task and present various techniques to approximately solve this task. Moreover, we present an approximate heuristic technique for searching for the most probable analysis in probabilistic constraint logic programs.
  • Log-linear models provide a statistically sound framework for Stochastic ``Unification-Based'' Grammars (SUBGs) and stochastic versions of other kinds of grammars. We describe two computationally-tractable ways of estimating the parameters of such grammars from a training corpus of syntactic analyses, and apply these to estimate a stochastic version of Lexical-Functional Grammar.
  • This paper describes a method for estimating conditional probability distributions over the parses of ``unification-based'' grammars which can utilize auxiliary distributions that are estimated by other means. We show how this can be used to incorporate information about lexical selectional preferences gathered from other sources into Stochastic ``Unification-based'' Grammars (SUBGs). While we apply this estimator to a Stochastic Lexical-Functional Grammar, the method is general, and should be applicable to stochastic versions of HPSGs, categorial grammars and transformational grammars.
  • We present a technique for automatic induction of slot annotations for subcategorization frames, based on induction of hidden classes in the EM framework of statistical estimation. The models are empirically evalutated by a general decision test. Induction of slot labeling for subcategorization frames is accomplished by a further application of EM, and applied experimentally on frame observations derived from parsing large corpora. We outline an interpretation of the learned representations as theoretical-linguistic decompositional lexical entries.
  • The paper describes an extensive experiment in inside-outside estimation of a lexicalized probabilistic context free grammar for German verb-final clauses. Grammar and formalism features which make the experiment feasible are described. Successive models are evaluated on precision and recall of phrase markup.
  • We present a probabilistic model for constraint-based grammars and a method for estimating the parameters of such models from incomplete, i.e., unparsed data. Whereas methods exist to estimate the parameters of probabilistic context-free grammars from incomplete data (Baum 1970), so far for probabilistic grammars involving context-dependencies only parameter estimation techniques from complete, i.e., fully parsed data have been presented (Abney 1997). However, complete-data estimation requires labor-intensive, error-prone, and grammar-specific hand-annotating of large language corpora. We present a log-linear probability model for constraint logic programming, and a general algorithm to estimate the parameters of such models from incomplete data by extending the estimation algorithm of Della-Pietra, Della-Pietra, and Lafferty (1997) to incomplete data settings.
  • This paper addresses two central problems for probabilistic processing models: parameter estimation from incomplete data and efficient retrieval of most probable analyses. These questions have been answered satisfactorily only for probabilistic regular and context-free models. We address these problems for a more expressive probabilistic constraint logic programming model. We present a log-linear probability model for probabilistic constraint logic programming. On top of this model we define an algorithm to estimate the parameters and to select the properties of log-linear models from incomplete data. This algorithm is an extension of the improved iterative scaling algorithm of Della-Pietra, Della-Pietra, and Lafferty (1995). Our algorithm applies to log-linear models in general and is accompanied with suitable approximation methods when applied to large data spaces. Furthermore, we present an approach for searching for most probable analyses of the probabilistic constraint logic programming model. This method can be applied to the ambiguity resolution problem in natural language processing applications.
  • Constraint logic grammars provide a powerful formalism for expressing complex logical descriptions of natural language phenomena in exact terms. Describing some of these phenomena may, however, require some form of graded distinctions which are not provided by such grammars. Recent approaches to weighted constraint logic grammars attempt to address this issue by adding numerical calculation schemata to the deduction scheme of the underlying CLP framework. Currently, these extralogical extensions are not related to the model-theoretic counterpart of the operational semantics of CLP, i.e., they do not come with a formal semantics at all. The aim of this paper is to present a clear formal semantics for weighted constraint logic grammars, which abstracts away from specific interpretations of weights, but nevertheless gives insights into the parsing problem for such weighted grammars. Building on the formalization of constraint logic grammars in the CLP scheme of Hoehfeld and Smolka 1988, this formal semantics will be given by a quantitative version of CLP. Such a quantitative CLP scheme can also be valuable for CLP tasks independent of grammars.