-
We study Bayesian hypernetworks: a framework for approximate Bayesian
inference in neural networks. A Bayesian hypernetwork $\h$ is a neural network
which learns to transform a simple noise distribution, $p(\vec\epsilon) =
\N(\vec 0,\mat I)$, to a distribution $q(\pp) := q(h(\vec\epsilon))$ over the
parameters $\pp$ of another neural network (the "primary network")\@. We train
$q$ with variational inference, using an invertible $\h$ to enable efficient
estimation of the variational lower bound on the posterior $p(\pp | \D)$ via
sampling. In contrast to most methods for Bayesian deep learning, Bayesian
hypernets can represent a complex multimodal approximate posterior with
correlations between parameters, while enabling cheap iid sampling of~$q(\pp)$.
In practice, Bayesian hypernets can provide a better defense against
adversarial examples than dropout, and also exhibit competitive performance on
a suite of tasks which evaluate model uncertainty, including regularization,
active learning, and anomaly detection.
-
Normalizing flows and autoregressive models have been successfully combined
to produce state-of-the-art results in density estimation, via Masked
Autoregressive Flows (MAF), and to accelerate state-of-the-art WaveNet-based
speech synthesis to 20x faster than real-time, via Inverse Autoregressive Flows
(IAF). We unify and generalize these approaches, replacing the (conditionally)
affine univariate transformations of MAF/IAF with a more general class of
invertible univariate transformations expressed as monotonic neural networks.
We demonstrate that the proposed neural autoregressive flows (NAF) are
universal approximators for continuous probability distributions, and their
greater expressivity allows them to better capture multimodal target
distributions. Experimentally, NAF yields state-of-the-art performance on a
suite of density estimation tasks and outperforms IAF in variational
autoencoders trained on binarized MNIST.
-
Learning distributed sentence representations remains an interesting problem
in the field of Natural Language Processing (NLP). We want to learn a model
that approximates the conditional latent space over the representations of a
logical antecedent of the given statement. In our paper, we propose an approach
to generating sentences, conditioned on an input sentence and a logical
inference label. We do this by modeling the different possibilities for the
output sentence as a distribution over the latent representation, which we
train using an adversarial objective. We evaluate the model using two
state-of-the-art models for the Recognizing Textual Entailment (RTE) task, and
measure the BLEU scores against the actual sentences as a probe for the
diversity of sentences produced by our model. The experiment results show that,
given our framework, we have clear ways to improve the quality and diversity of
generated sentences.
-
We argue that the estimation of mutual information between high dimensional
continuous random variables can be achieved by gradient descent over neural
networks. We present a Mutual Information Neural Estimator (MINE) that is
linearly scalable in dimensionality as well as in sample size, trainable
through back-prop, and strongly consistent. We present a handful of
applications on which MINE can be used to minimize or maximize mutual
information. We apply MINE to improve adversarially trained generative models.
We also use MINE to implement Information Bottleneck, applying it in tasks
related to supervised classification; our results demonstrate substantial
improvement in flexibility and performance in these settings.
-
Learning inter-domain mappings from unpaired data can improve performance in
structured prediction tasks, such as image segmentation, by reducing the need
for paired data. CycleGAN was recently proposed for this problem, but
critically assumes the underlying inter-domain mapping is approximately
deterministic and one-to-one. This assumption renders the model ineffective for
tasks requiring flexible, many-to-many mappings. We propose a new model, called
Augmented CycleGAN, which learns many-to-many mappings between domains. We
examine Augmented CycleGAN qualitatively and quantitatively on several image
datasets.
-
We propose a neural language model capable of unsupervised syntactic
structure induction. The model leverages the structure information to form
better semantic representations and better language modeling. Standard
recurrent neural networks are limited by their structure and fail to
efficiently use syntactic information. On the other hand, tree-structured
recursive networks usually require additional structural supervision at the
cost of human expert annotation. In this paper, We propose a novel neural
language model, called the Parsing-Reading-Predict Networks (PRPN), that can
simultaneously induce the syntactic structure from unannotated sentences and
leverage the inferred structure to learn a better language model. In our model,
the gradient can be directly back-propagated from the language model loss into
the neural parsing network. Experiments show that the proposed model can
discover the underlying syntactic structure and achieve state-of-the-art
performance on word/character-level language model tasks.
-
We propose a novel hierarchical generative model with a simple Markovian
structure and a corresponding inference model. Both the generative and
inference model are trained using the adversarial learning paradigm. We
demonstrate that the hierarchical structure supports the learning of
progressively more abstract representations as well as providing semantically
meaningful reconstructions with different levels of fidelity. Furthermore, we
show that minimizing the Jensen-Shanon divergence between the generative and
inference network is enough to minimize the reconstruction error. The resulting
semantically meaningful hierarchical latent structure discovery is exemplified
on the CelebA dataset. There, we show that the features learned by our model in
an unsupervised way outperform the best handcrafted features. Furthermore, the
extracted features remain competitive when compared to several recent deep
supervised approaches on an attribute prediction task on CelebA. Finally, we
leverage the model's inference network to achieve state-of-the-art performance
on a semi-supervised variant of the MNIST digit classification task.
-
In data-mining applications, we are frequently faced with a large fraction of
missing entries in the data matrix, which is problematic for most discriminant
machine learning algorithms. A solution that we explore in this paper is the
use of a generative model (a mixture of Gaussians) to compute the conditional
expectation of the missing variables given the observed variables. Since
training a Gaussian mixture with many different patterns of missing values can
be computationally very expensive, we introduce a spanning-tree based algorithm
that significantly speeds up training in these conditions. We also observe that
good results can be obtained by using the generative model to fill-in the
missing values for a separate discriminant learning algorithm.
-
Generative Adversarial Networks (GANs) are powerful generative models, but
suffer from training instability. The recently proposed Wasserstein GAN (WGAN)
makes progress toward stable training of GANs, but sometimes can still generate
only low-quality samples or fail to converge. We find that these problems are
often due to the use of weight clipping in WGAN to enforce a Lipschitz
constraint on the critic, which can lead to undesired behavior. We propose an
alternative to clipping weights: penalize the norm of gradient of the critic
with respect to its input. Our proposed method performs better than standard
WGAN and enables stable training of a wide variety of GAN architectures with
almost no hyperparameter tuning, including 101-layer ResNets and language
models over discrete data. We also achieve high quality generations on CIFAR-10
and LSUN bedrooms.
-
In this paper, we study two aspects of the variational autoencoder (VAE): the
prior distribution over the latent variables and its corresponding posterior.
First, we decompose the learning of VAEs into layerwise density estimation, and
argue that having a flexible prior is beneficial to both sample generation and
inference. Second, we analyze the family of inverse autoregressive flows
(inverse AF) and show that with further improvement, inverse AF could be used
as universal approximation to any complicated posterior. Our analysis results
in a unified approach to parameterizing a VAE, without the need to restrict
ourselves to use factorial Gaussians in the latent real space.
-
Achieving artificial visual reasoning - the ability to answer image-related
questions which require a multi-step, high-level process - is an important step
towards artificial general intelligence. This multi-modal task requires
learning a question-dependent, structured reasoning process over images from
language. Standard deep learning approaches tend to exploit biases in the data
rather than learn this underlying structure, while leading methods learn to
visually reason successfully but are hand-crafted for reasoning. We show that a
general-purpose, Conditional Batch Normalization approach achieves
state-of-the-art results on the CLEVR Visual Reasoning benchmark with a 2.4%
error rate. We outperform the next best end-to-end method (4.5%) and even
methods that use extra supervision (3.1%). We probe our model to shed light on
how it reasons, showing it has learned a question-dependent, multi-step
process. Previous work has operated under the assumption that visual reasoning
calls for a specialized architecture, but we show that a general architecture
with proper conditioning can learn to visually reason effectively.
-
Advances in neural variational inference have facilitated the learning of
powerful directed graphical models with continuous latent variables, such as
variational autoencoders. The hope is that such models will learn to represent
rich, multi-modal latent factors in real-world data, such as natural language
text. However, current models often assume simplistic priors on the latent
variables - such as the uni-modal Gaussian distribution - which are incapable
of representing complex latent factors efficiently. To overcome this
restriction, we propose the simple, but highly flexible, piecewise constant
distribution. This distribution has the capacity to represent an exponential
number of modes of a latent target distribution, while remaining mathematically
tractable. Our results demonstrate that incorporating this new latent
distribution into different models yields substantial improvements in natural
language processing tasks such as document modeling and natural language
generation for dialogue.
-
We introduce a general-purpose conditioning method for neural networks called
FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network
computation via a simple, feature-wise affine transformation based on
conditioning information. We show that FiLM layers are highly effective for
visual reasoning - answering image-related questions which require a
multi-step, high-level process - a task which has proven difficult for standard
deep learning methods that do not explicitly model reasoning. Specifically, we
show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error
for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are
robust to ablations and architectural modifications, and 4) generalize well to
challenging, new data from few examples or even zero-shot.
-
We propose zoneout, a novel method for regularizing RNNs. At each timestep,
zoneout stochastically forces some hidden units to maintain their previous
values. Like dropout, zoneout uses random noise to train a pseudo-ensemble,
improving generalization. But by preserving instead of dropping hidden units,
gradient information and state information are more readily propagated through
time, as in feedforward stochastic depth networks. We perform an empirical
investigation of various RNN regularizers, and find that zoneout gives
significant performance improvements across tasks. We achieve competitive
results with relatively simple models in character- and word-level language
modelling on the Penn Treebank and Text8 datasets, and combining with recurrent
batch normalization yields state-of-the-art results on permuted sequential
MNIST.
-
We propose a new self-organizing hierarchical softmax formulation for
neural-network-based language models over large vocabularies. Instead of using
a predefined hierarchical structure, our approach is capable of learning word
clusters with clear syntactical and semantic meaning during the language model
training process. We provide experiments on standard benchmarks for language
modeling and sentence compression tasks. We find that this approach is as fast
as other efficient softmax approximations, while achieving comparable or even
better performance relative to similar full softmax models.
-
It is commonly assumed that language refers to high-level visual concepts
while leaving low-level visual processing unaffected. This view dominates the
current literature in computational models for language-vision tasks, where
visual and linguistic input are mostly processed independently before being
fused into a single representation. In this paper, we deviate from this classic
pipeline and propose to modulate the \emph{entire visual processing} by
linguistic input. Specifically, we condition the batch normalization parameters
of a pretrained residual network (ResNet) on a language embedding. This
approach, which we call MOdulated RESnet (\MRN), significantly improves strong
baselines on two visual question answering tasks. Our ablation study shows that
modulating from the early stages of the visual processing is beneficial.
-
We examine the role of memorization in deep learning, drawing connections to
capacity, generalization, and adversarial robustness. While deep networks are
capable of memorizing noise data, our results suggest that they tend to
prioritize learning simple patterns first. In our experiments, we expose
qualitative differences in gradient-based optimization of deep neural networks
(DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned
explicit regularization (e.g., dropout) we can degrade DNN training performance
on noise datasets without compromising generalization on real data. Our
analysis suggests that the notions of effective capacity which are dataset
independent are unlikely to explain the generalization performance of deep
networks when trained with gradient based methods because training data itself
plays an important role in determining the degree of memorization.
-
Generative Adversarial Networks (GANs) have gathered a lot of attention from
the computer vision community, yielding impressive results for image
generation. Advances in the adversarial generation of natural language from
noise however are not commensurate with the progress made in generating images,
and still lag far behind likelihood based methods. In this paper, we take a
step towards generating natural language with a GAN objective alone. We
introduce a simple baseline that addresses the discrete output space problem
without relying on gradient estimators and show that it is able to achieve
state-of-the-art results on a Chinese poem generation dataset. We present
quantitative results on generating sentences from context-free and
probabilistic context-free grammars, and qualitative language modeling results.
A conditional version is also described that can generate sequences conditioned
on sentence characteristics.
-
End-to-end design of dialogue systems has recently become a popular research
topic thanks to powerful tools such as encoder-decoder architectures for
sequence-to-sequence learning. Yet, most current approaches cast human-machine
dialogue management as a supervised learning problem, aiming at predicting the
next utterance of a participant given the full history of the dialogue. This
vision is too simplistic to render the intrinsic planning problem inherent to
dialogue as well as its grounded nature, making the context of a dialogue
larger than the sole history. This is why only chit-chat and question answering
tasks have been addressed so far using end-to-end architectures. In this paper,
we introduce a Deep Reinforcement Learning method to optimize visually grounded
task-oriented dialogues, based on the policy gradient algorithm. This approach
is tested on a dataset of 120k dialogues collected through Mechanical Turk and
provides encouraging results at solving both the problem of generating natural
dialogues and the task of discovering a specific object in a complex picture.
-
We present an approach to training neural networks to generate sequences
using actor-critic methods from reinforcement learning (RL). Current
log-likelihood training methods are limited by the discrepancy between their
training and testing modes, as models must generate tokens conditioned on their
previous guesses rather than the ground-truth tokens. We address this problem
by introducing a \textit{critic} network that is trained to predict the value
of an output token, given the policy of an \textit{actor} network. This results
in a training procedure that is much closer to the test phase, and allows us to
directly optimize for a task-specific score such as BLEU. Crucially, since we
leverage these techniques in the supervised learning setting rather than the
traditional RL setting, we condition the critic network on the ground-truth
output. We show that our method leads to improved performance on both a
synthetic task, and for German-English machine translation. Our analysis paves
the way for such methods to be applied in natural language generation tasks,
such as machine translation, caption generation, and dialogue modelling.
-
We propose a reparameterization of LSTM that brings the benefits of batch
normalization to recurrent neural networks. Whereas previous works only apply
batch normalization to the input-to-hidden transformation of RNNs, we
demonstrate that it is both possible and beneficial to batch-normalize the
hidden-to-hidden transition, thereby reducing internal covariate shift between
time steps. We evaluate our proposal on various sequential problems such as
sequence classification, language modeling and question answering. Our
empirical results show that our batch-normalized LSTM consistently leads to
faster convergence and improved generalization.
-
In this paper, we propose to equip Generative Adversarial Networks with the
ability to produce direct energy estimates for samples.Specifically, we propose
a flexible adversarial training framework, and prove this framework not only
ensures the generator converges to the true data distribution, but also enables
the discriminator to retain the density information at the global optimal. We
derive the analytic form of the induced solution, and analyze the properties.
In order to make the proposed framework trainable in practice, we introduce two
effective approximation techniques. Empirically, the experiment results closely
match our theoretical analysis, verifying the discriminator is able to recover
the energy of data distribution.
-
We introduce the adversarially learned inference (ALI) model, which jointly
learns a generation network and an inference network using an adversarial
process. The generation network maps samples from stochastic latent variables
to the data space while the inference network maps training examples in data
space to the space of latent variables. An adversarial game is cast between
these two networks and a discriminative network is trained to distinguish
between joint latent/data-space samples from the generative network and joint
samples from the inference network. We illustrate the ability of the model to
learn mutually coherent inference and generation networks through the
inspections of model samples and reconstructions and confirm the usefulness of
the learned representations by obtaining a performance competitive with
state-of-the-art on the semi-supervised SVHN and CIFAR10 tasks.
-
In this paper we propose a novel model for unconditional audio generation
based on generating one audio sample at a time. We show that our model, which
profits from combining memory-less modules, namely autoregressive multilayer
perceptrons, and stateful recurrent neural networks in a hierarchical structure
is able to capture underlying sources of variations in the temporal sequences
over very long time spans, on three datasets of different nature. Human
evaluation on the generated samples indicate that our model is preferred over
competing models. We also show how each component of the model contributes to
the exhibited performance.
-
We introduce GuessWhat?!, a two-player guessing game as a testbed for
research on the interplay of computer vision and dialogue systems. The goal of
the game is to locate an unknown object in a rich image scene by asking a
sequence of questions. Higher-level image understanding, like spatial reasoning
and language grounding, is required to solve the proposed task. Our key
contribution is the collection of a large-scale dataset consisting of 150K
human-played games with a total of 800K visual question-answer pairs on 66K
images. We explain our design decisions in collecting the dataset and introduce
the oracle and questioner tasks that are associated with the two players of the
game. We prototyped deep learning models to establish initial baselines of the
introduced tasks.