
Topic models have been widely explored as probabilistic generative models of
documents. Traditional inference methods have sought closedform derivations
for updating the models, however as the expressiveness of these models grows,
so does the difficulty of performing fast and accurate inference over their
parameters. This paper presents alternative neural approaches to topic
modelling by providing parameterisable distributions over topics which permit
training by backpropagation in the framework of neural variational inference.
In addition, with the help of a stickbreaking construction, we propose a
recurrent network that is able to discover a notionally unbounded number of
topics, analogous to Bayesian nonparametric topic models. Experimental results
on the MXM Song Lyrics, 20NewsGroups and Reuters News datasets demonstrate the
effectiveness and efficiency of these neural topic models.

Words in natural language follow a Zipfian distribution whereby some words
are frequent but most are rare. Learning representations for words in the "long
tail" of this distribution requires enormous amounts of data. Representations
of rare words trained directly on end tasks are usually poor, requiring us to
pretrain embeddings on external data, or treat all rare words as
outofvocabulary words with a unique representation. We provide a method for
predicting embeddings of rare words on the fly from small amounts of auxiliary
data with a network trained endtoend for the downstream task. We show that
this improves results against baselines where embeddings are trained on the end
task for reading comprehension, recognizing textual entailment and language
modeling.

We introduce a new dataset of logical entailments for the purpose of
measuring models' ability to capture and exploit the structure of logical
expressions against an entailment prediction task. We use this task to compare
a series of architectures which are ubiquitous in the sequenceprocessing
literature, in addition to a new model classPossibleWorldNetswhich
computes entailment as a "convolution over possible worlds". Results show that
convolutional networks present the wrong inductive bias for this class of
problems relative to LSTM RNNs, treestructured neural networks outperform LSTM
RNNs due to their enhanced ability to exploit the syntax of logic, and
PossibleWorldNets outperform all benchmarks.

Artificial Neural Networks are powerful function approximators capable of
modelling solutions to a wide variety of problems, both supervised and
unsupervised. As their size and expressivity increases, so too does the
variance of the model, yielding a nearly ubiquitous overfitting problem.
Although mitigated by a variety of model regularisation methods, the common
cure is to seek large amounts of training datawhich is not necessarily
easily obtainedthat sufficiently approximates the data distribution of the
domain we wish to test on. In contrast, logic programming methods such as
Inductive Logic Programming offer an extremely dataefficient process by which
models can be trained to reason on symbolic domains. However, these methods are
unable to deal with the variety of domains neural networks can be applied to:
they are not robust to noise in or mislabelling of inputs, and perhaps more
importantly, cannot be applied to nonsymbolic domains where the data is
ambiguous, such as operating on raw pixels. In this paper, we propose a
Differentiable Inductive Logic framework, which can not only solve tasks which
traditional ILP systems are suited for, but shows a robustness to noise and
error in the training data which ILP cannot cope with. Furthermore, as it is
trained by backpropagation against a likelihood objective, it can be hybridised
by connecting it with neural networks over ambiguous data in order to be
applied to domains which ILP cannot address, while providing data efficiency
and generalisation beyond what neural networks on their own can achieve.

We formulate sequence to sequence transduction as a noisy channel decoding
problem and use recurrent neural networks to parameterise the source and
channel models. Unlike direct models which can suffer from explainingaway
effects during training, noisy channel models must produce outputs that explain
their inputs, and their component models can be trained with not only paired
training samples but also unpaired samples from the marginal output
distribution. Using a latent variable to control how much of the conditioning
sequence the channel model needs to read in order to generate a subsequent
symbol, we obtain a tractable and effective beam search decoder. Experimental
results on abstractive sentence summarisation, morphological inflection, and
machine translation show that noisy channel models outperform direct models,
and that they significantly benefit from increased amounts of unpaired output
data that direct models cannot easily use.

We use reinforcement learning to learn treestructured neural networks for
computing representations of natural language sentences. In contrast with prior
work on treestructured models in which the trees are either provided as input
or predicted using supervision from explicit treebank annotations, the tree
structures in this work are optimized to improve performance on a downstream
task. Experiments demonstrate the benefit of learning taskspecific composition
orders, outperforming both sequential encoders and recursive encoders based on
treebank annotations. We analyze the induced trees and show that while they
discover some linguistically intuitive structures (e.g., noun phrases, simple
verb phrases), they are different than conventional English syntactic
structures.

We present a novel semisupervised approach for sequence transduction and
apply it to semantic parsing. The unsupervised component is based on a
generative model in which latent sentences generate the unpaired logical forms.
We apply this method to a number of semantic parsing tasks focusing on domains
with limited access to labelled training data and extend those datasets with
synthetically generated logical forms.

Many language generation tasks require the production of text conditioned on
both structured and unstructured inputs. We present a novel neural network
architecture which generates an output sequence conditioned on an arbitrary
number of input functions. Crucially, our approach allows both the choice of
conditioning context and the granularity of generation, for example characters
or tokens, to be marginalised, thus permitting scalable and effective training.
Using this framework, we address the problem of generating programming code
from a mixed natural language and structured specification. We create two new
data sets for this paradigm derived from the collectible trading card games
Magic the Gathering and Hearthstone. On these, and a third preexisting corpus,
we demonstrate that marginalising multiple predictors allows our model to
outperform strong benchmarks.

While most approaches to automatically recognizing entailment relations have
used classifiers employing hand engineered features derived from complex
natural language processing pipelines, in practice their performance has been
only slightly better than bagofword pair classifiers using only lexical
similarity. The only attempt so far to build an endtoend differentiable
neural network for entailment failed to outperform such a simple similarity
classifier. In this paper, we propose a neural model that reads two sentences
to determine entailment using long shortterm memory units. We extend this
model with a wordbyword neural attention mechanism that encourages reasoning
over entailments of pairs of words and phrases. Furthermore, we present a
qualitative analysis of attention weights produced by this model, demonstrating
such reasoning capabilities. On a large entailment dataset this model
outperforms the previous best neural model and a classifier with engineered
features by a substantial margin. It is the first generic endtoend
differentiable system that achieves stateoftheart accuracy on a textual
entailment dataset.

Teaching machines to read natural language documents remains an elusive
challenge. Machine reading systems can be tested on their ability to answer
questions posed on the contents of documents that they have seen, but until now
large scale training and test datasets have been missing for this type of
evaluation. In this work we define a new methodology that resolves this
bottleneck and provides large scale supervised reading comprehension data. This
allows us to develop a class of attention based deep neural networks that learn
to read real documents and answer complex questions with minimal prior
knowledge of language structure.

Recently, strong results have been demonstrated by Deep Recurrent Neural
Networks on natural language transduction problems. In this paper we explore
the representational power of these models using synthetic grammars designed to
exhibit phenomena similar to those found in real transduction problems such as
machine translation. These experiments lead us to propose new memorybased
recurrent networks that implement continuously differentiable analogues of
traditional data structures such as Stacks, Queues, and DeQues. We show that
these architectures exhibit superior generalisation performance to Deep RNNs
and are often able to learn the underlying generating algorithms in our
transduction experiments.

This paper aims to explore the effect of prior disambiguation on neural
network based compositional models, with the hope that better semantic
representations for text compounds can be produced. We disambiguate the input
word vectors before they are fed into a compositional deep net. A series of
evaluations shows the positive effect of prior disambiguation for such deep
models.

Many successful approaches to semantic parsing build on top of the syntactic
analysis of text, and make use of distributional representations or statistical
models to match parses to ontologyspecific queries. This paper presents a
novel deep learning architecture which provides a semantic parsing system
through the union of two neural models of language semantics. It allows for the
generation of ontologyspecific queries from natural language statements and
questions without the need for parsing, which makes it especially suitable to
grammatically malformed or syntactically atypical text, such as tweets, as well
as permitting the development of semantic parsers for resourcepoor languages.

The ability to accurately represent sentences is central to language
understanding. We describe a convolutional architecture dubbed the Dynamic
Convolutional Neural Network (DCNN) that we adopt for the semantic modelling of
sentences. The network uses Dynamic kMax Pooling, a global pooling operation
over linear sequences. The network handles input sentences of varying length
and induces a feature graph over the sentence that is capable of explicitly
capturing short and longrange relations. The network does not rely on a parse
tree and is easily applicable to any language. We test the DCNN in four
experiments: small scale binary and multiclass sentiment prediction, sixway
question classification and Twitter sentiment prediction by distant
supervision. The network achieves excellent performance in the first three
tasks and a greater than 25% error reduction in the last task with respect to
the strongest baseline.

This thesis is about the problem of compositionality in distributional
semantics. Distributional semantics presupposes that the meanings of words are
a function of their occurrences in textual contexts. It models words as
distributions over these contexts and represents them as vectors in high
dimensional spaces. The problem of compositionality for such models concerns
itself with how to produce representations for larger units of text by
composing the representations of smaller units of text.
This thesis focuses on a particular approach to this compositionality
problem, namely using the categorical framework developed by Coecke, Sadrzadeh,
and Clark, which combines syntactic analysis formalisms with distributional
semantic representations of meaning to produce syntactically motivated
composition operations. This thesis shows how this approach can be
theoretically extended and practically implemented to produce concrete
compositional distributional models of natural language semantics. It
furthermore demonstrates that such models can perform on par with, or better
than, other competing approaches in the field of natural language processing.
There are three principal contributions to computational linguistics in this
thesis. The first is to extend the DisCoCat framework on the syntactic front
and semantic front, incorporating a number of syntactic analysis formalisms and
providing learning procedures allowing for the generation of concrete
compositional distributional models. The second contribution is to evaluate the
models developed from the procedures presented here, showing that they
outperform other compositional distributional models present in the literature.
The third contribution is to show how using category theory to solve linguistic
problems forms a sound basis for research, illustrated by examples of work on
this topic, that also suggest directions for future research.

We discuss an algorithm which produces the meaning of a sentence given
meanings of its words, and its resemblance to quantum teleportation. In fact,
this protocol was the main source of inspiration for this algorithm which has
many applications in the area of Natural Language Processing.

With the increasing empirical success of distributional models of
compositional semantics, it is timely to consider the types of textual logic
that such models are capable of capturing. In this paper, we address
shortcomings in the ability of current models to capture logical operations
such as negation. As a solution we propose a tripartite formulation for a
continuous vector space representation of semantics and subsequently use this
representation to develop a formal compositional notion of negation within such
models.

The development of compositional distributional models of semantics
reconciling the empirical aspects of distributional semantics with the
compositional aspects of formal semantics is a popular topic in the
contemporary literature. This paper seeks to bring this reconciliation one step
further by showing how the mathematical constructs commonly used in
compositional distributional models, such as tensors and matrices, can be used
to simulate different aspects of predicate logic.
This paper discusses how the canonical isomorphism between tensors and
multilinear maps can be exploited to simulate a fullblown quantifierfree
predicate calculus using tensors. It provides tensor interpretations of the set
of logical connectives required to model propositional calculi. It suggests a
variant of these tensor calculi capable of modelling quantifiers, using few
nonlinear operations. It finally discusses the relation between these
variants, and how this relation should constitute the subject of future work.

The Distributional Compositional Categorical (DisCoCat) model is a
mathematical framework that provides compositional semantics for meanings of
natural language sentences. It consists of a computational procedure for
constructing meanings of sentences, given their grammatical structure in terms
of compositional typelogic, and given the empirically derived meanings of
their words. For the particular case that the meaning of words is modelled
within a distributional vector space model, its experimental predictions,
derived from real large scale data, have outperformed other empirically
validated methods that could build vectors for a full sentence. This success
can be attributed to a conceptually motivated mathematical underpinning, by
integrating qualitative compositional typelogic and quantitative modelling of
meaning within a categorytheoretic mathematical framework.
The typelogic used in the DisCoCat model is Lambek's pregroup grammar.
Pregroup types form a posetal compact closed category, which can be passed, in
a functorial manner, on to the compact closed structure of vector spaces,
linear maps and tensor product. The diagrammatic versions of the equational
reasoning in compact closed categories can be interpreted as the flow of word
meanings within sentences. Pregroups simplify Lambek's previous typelogic, the
Lambek calculus, which has been extensively used to formalise and reason about
various linguistic phenomena. The apparent reliance of the DisCoCat on
pregroups has been seen as a shortcoming. This paper addresses this concern, by
pointing out that one may as well realise a functorial passage from the
original typelogic of Lambek, a monoidal biclosed category, to vector spaces,
or to any other model of meaning organised within a monoidal biclosed
category. The corresponding string diagram calculus, due to Baez and Stay, now
depicts the flow of word meanings.

We present a model for compositional distributional semantics related to the
framework of Coecke et al. (2010), and emulating formal semantics by
representing functions as tensors and arguments as vectors. We introduce a new
learning method for tensors, generalising the approach of Baroni and Zamparelli
(2010). We evaluate it on two benchmark data sets, and find it to outperform
existing leading methods. We argue in our analysis that the nature of this
learning method also renders it suitable for solving more subtle problems
compositional distributional models might face.

Formal and distributional semantic models offer complementary benefits in
modeling meaning. The categorical compositional distributional (DisCoCat) model
of meaning of Coecke et al. (arXiv:1003.4394v1 [cs.CL]) combines aspected of
both to provide a general framework in which meanings of words, obtained
distributionally, are composed using methods from the logical setting to form
sentence meaning. Concrete consequences of this general abstract setting and
applications to empirical data are under active study (Grefenstette et al.,
arxiv:1101.0309; Grefenstette and Sadrzadeh, arXiv:1106.4058v1 [cs.CL]). . In
this paper, we extend this study by examining transitive verbs, represented as
matrices in a DisCoCat. We discuss three ways of constructing such matrices,
and evaluate each method in a disambiguation task developed by Grefenstette and
Sadrzadeh (arXiv:1106.4058v1 [cs.CL]).

Modelling compositional meaning for sentences using empirical distributional
methods has been a challenge for computational linguists. We implement the
abstract categorical model of Coecke et al. (arXiv:1003.4394v1 [cs.CL]) using
data from the BNC and evaluate it. The implementation is based on unsupervised
learning of matrices for relational words and applying them to the vectors of
their arguments. The evaluation is based on the word disambiguation task
developed by Mitchell and Lapata (2008) for intransitive sentences, and on a
similar new experiment designed for transitive sentences. Our model matches the
results of its competitors in the first experiment, and betters them in the
second. The general improvement in results with increase in syntactic
complexity showcases the compositional power of our model.

We provide an overview of the hybrid compositional distributional model of
meaning, developed in Coecke et al. (arXiv:1003.4394v1 [cs.CL]), which is based
on the categorical methods also applied to the analysis of information flow in
quantum protocols. The mathematical setting stipulates that the meaning of a
sentence is a linear function of the tensor products of the meanings of its
words. We provide concrete constructions for this definition and present
techniques to build vector spaces for meaning vectors of words, as well as that
of sentences. The applicability of these methods is demonstrated via a toy
vector space as well as real data from the British National Corpus and two
disambiguation experiments.

Coecke, Sadrzadeh, and Clark (arXiv:1003.4394v1 [cs.CL]) developed a
compositional model of meaning for distributional semantics, in which each word
in a sentence has a meaning vector and the distributional meaning of the
sentence is a function of the tensor products of the word vectors. Abstractly
speaking, this function is the morphism corresponding to the grammatical
structure of the sentence in the category of finite dimensional vector spaces.
In this paper, we provide a concrete method for implementing this linear
meaning map, by constructing a corpusbased vector space for the type of
sentence. Our construction method is based on structured vector spaces whereby
meaning vectors of all sentences, regardless of their grammatical structure,
live in the same vector space. Our proposed sentence space is the tensor
product of two noun spaces, in which the basis vectors are pairs of words each
augmented with a grammatical role. This enables us to compare meanings of
sentences by simply taking the inner product of their vectors.