
We present a novel recurrent neural network (RNN) architecture that combines
the remembering ability of unitary RNNs with the ability of gated RNNs to
effectively forget redundant information in the input sequence. We achieve this
by extending Unitary RNNs with a gating mechanism. Our model is able to
outperform LSTMs, GRUs and Unitary RNNs on different benchmark tasks, as the
ability to simultaneously remember long term dependencies and forget irrelevant
information in the input sequence helps with many natural long term sequential
tasks such as language modeling and question answering. We provide competitive
results along with an analysis of our model on the bAbI Question Answering
task, PennTreeBank, as well as synthetic tasks that involve longterm
dependencies such as parenthesis, denoising and copying tasks.

We investigate the integration of a planning mechanism into an
encoderdecoder architecture with an explicit alignment for characterlevel
machine translation. We develop a model that plans ahead when it computes
alignments between the source and target sequences, constructing a matrix of
proposed future alignments and a commitment vector that governs whether to
follow or recompute the plan. This mechanism is inspired by the strategic
attentive reader and writer (STRAW) model. Our proposed model is endtoend
trainable with fully differentiable operations. We show that it outperforms a
strong baseline on three characterlevel decoder neural machine translation on
WMT'15 corpus. Our analysis demonstrates that our model can compute
qualitatively intuitive alignments and achieves superior performance with fewer
parameters.

We propose a recurrent neural model that generates naturallanguage questions
from documents, conditioned on answers. We show how to train the model using a
combination of supervised and reinforcement learning. After teacher forcing for
standard maximum likelihood training, we finetune the model using policy
gradient techniques to maximize several rewards that measure question quality.
Most notably, one of these rewards is the performance of a questionanswering
system. We motivate question generation as a means to improve the performance
of question answering systems. Our model is trained and evaluated on the recent
questionanswering dataset SQuAD.

We extend neural Turing machine (NTM) model into a dynamic neural Turing
machine (DNTM) by introducing a trainable memory addressing scheme. This
addressing scheme maintains for each memory cell two separate vectors, content
and address vectors. This allows the DNTM to learn a wide variety of
locationbased addressing strategies including both linear and nonlinear ones.
We implement the DNTM with both continuous, differentiable and discrete,
nondifferentiable read/write mechanisms. We investigate the mechanisms and
effects of learning to read and write into a memory through experiments on
Facebook bAbI tasks using both a feedforward and GRUcontroller. The DNTM is
evaluated on a set of Facebook bAbI tasks and shown to outperform NTM and LSTM
baselines. We have done extensive analysis of our model and different
variations of NTM on bAbI task. We also provide further experimental results on
sequential pMNIST, Stanford Natural Language Inference, associative recall and
copy tasks.

Stochastic gradient algorithms are the main focus of largescale optimization
problems and led to important successes in the recent advancement of the deep
learning algorithms. The convergence of SGD depends on the careful choice of
learning rate and the amount of the noise in stochastic estimates of the
gradients. In this paper, we propose an adaptive learning rate algorithm, which
utilizes stochastic curvature information of the loss function for
automatically tuning the learning rates. The information about the elementwise
curvature of the loss function is estimated from the local statistics of the
stochastic first order gradients. We further propose a new variance reduction
technique to speed up the convergence. In our experiments with deep neural
networks, we obtained better performance compared to the popular stochastic
gradient algorithms.

We propose a reparameterization of LSTM that brings the benefits of batch
normalization to recurrent neural networks. Whereas previous works only apply
batch normalization to the inputtohidden transformation of RNNs, we
demonstrate that it is both possible and beneficial to batchnormalize the
hiddentohidden transition, thereby reducing internal covariate shift between
time steps. We evaluate our proposal on various sequential problems such as
sequence classification, language modeling and question answering. Our
empirical results show that our batchnormalized LSTM consistently leads to
faster convergence and improved generalization.

Recent empirical results on longterm dependency tasks have shown that neural
networks augmented with an external memory can learn the longterm dependency
tasks more easily and achieve better generalization than vanilla recurrent
neural networks (RNN). We suggest that memory augmented neural networks can
reduce the effects of vanishing gradients by creating shortcut (or wormhole)
connections. Based on this observation, we propose a novel memory augmented
neural network model called TARDIS (Temporal Automatic Relation Discovery in
Sequences). The controller of TARDIS can store a selective set of embeddings of
its own previous hidden states into an external memory and revisit them as and
when needed. For TARDIS, memory acts as a storage for wormhole connections to
the past to propagate the gradients more effectively and it helps to learn the
temporal dependencies. The memory structure of TARDIS has similarities to both
Neural Turing Machines (NTM) and Dynamic Neural Turing Machines (DNTM), but
both read and write operations of TARDIS are simpler and more efficient. We use
discrete addressing for read/write operations which helps to substantially to
reduce the vanishing gradient problem with very long sequences. Read and write
operations in TARDIS are tied with a heuristic once the memory becomes full,
and this makes the learning problem simpler when compared to NTM or DNTM type
of architectures. We provide a detailed analysis on the gradient propagation in
general for MANNs. We evaluate our models on different longterm dependency
tasks and report competitive results in all of them.

In this work, we model abstractive text summarization using Attentional
EncoderDecoder Recurrent Neural Networks, and show that they achieve
stateoftheart performance on two different corpora. We propose several novel
models that address critical problems in summarization that are not adequately
modeled by the basic architecture, such as modeling keywords, capturing the
hierarchy of sentencetoword structure, and emitting words that are rare or
unseen at training time. Our work shows that many of our proposed models
contribute to further improvement in performance. We also propose a new dataset
consisting of multisentence summaries, and establish performance benchmarks
for further research.

The problem of rare and unknown words is an important issue that can
potentially influence the performance of many NLP systems, including both the
traditional countbased and the deep learning models. We propose a novel way to
deal with the rare and unseen words for the neural network models using
attention. Our model uses two softmax layers in order to predict the next word
in conditional language models: one predicts the location of a word in the
source sentence, and the other predicts a word in the shortlist vocabulary. At
each timestep, the decision of which softmax layer to use choose adaptively
made by an MLP which is conditioned on the context.~We motivate our work from a
psychological evidence that humans naturally have a tendency to point towards
objects in the context or the environment when the name of an object is not
known.~We observe improvements on two tasks, neural machine translation on the
Europarl English to French parallel corpora and text summarization on the
Gigaword dataset using our proposed model.

The optimization of deep neural networks can be more challenging than
traditional convex optimization problems due to the highly nonconvex nature of
the loss function, e.g. it can involve pathological landscapes such as
saddlesurfaces that can be difficult to escape for algorithms based on simple
gradient descent. In this paper, we attack the problem of optimization of
highly nonconvex neural networks by starting with a smoothed  or
\textit{mollified}  objective function that gradually has a more nonconvex
energy landscape during the training. Our proposition is inspired by the recent
studies in continuation methods: similar to curriculum methods, we begin
learning an easier (possibly convex) objective function and let it evolve
during the training, until it eventually goes back to being the original,
difficult to optimize, objective function. The complexity of the mollified
networks is controlled by a single hyperparameter which is annealed during the
training. We show improvements on various difficult optimization tasks and
establish a relationship with recent works on continuation methods for neural
networks and mollifiers.

Over the past decade, largescale supervised learning corpora have enabled
machine learning researchers to make substantial advances. However, to this
date, there are no largescale questionanswer corpora available. In this paper
we present the 30M Factoid QuestionAnswer Corpus, an enormous question answer
pair corpus produced by applying a novel neural network architecture on the
knowledge base Freebase to transduce facts into natural language questions. The
produced question answer pairs are evaluated both by human evaluators and using
automatic evaluation metrics, including wellestablished machine translation
and sentence similarity metrics. Across all evaluation criteria the
questiongeneration model outperforms the competing templatebased baseline.
Furthermore, when presented to human evaluators, the generated questions appear
comparable in quality to real humangenerated questions.

Theano is a Python library that allows to define, optimize, and evaluate
mathematical expressions involving multidimensional arrays efficiently. Since
its introduction, it has been one of the most used CPU and GPU mathematical
compilers  especially in the machine learning community  and has shown steady
performance improvements. Theano is being actively and continuously developed
since 2008, multiple frameworks have been built on top of it and it has been
used to produce many stateoftheart machine learning models.
The present article is structured as follows. Section I provides an overview
of the Theano software and its community. Section II presents the principal
features of Theano and how to use them, and compares them with other similar
projects. Section III focuses on recentlyintroduced functionalities and
improvements. Section IV compares the performance of Theano against Torch7 and
TensorFlow on several machine learning models. Section V discusses current
limitations of Theano and potential ways of improving it.

Common nonlinear activation functions used in neural networks can cause
training difficulties due to the saturation behavior of the activation
function, which may hide dependencies that are not visible to vanillaSGD
(using first order gradients only). Gating mechanisms that use softly
saturating activation functions to emulate the discrete switching of digital
logic circuits are good examples of this. We propose to exploit the injection
of appropriate noise so that the gradients may flow easily, even if the
noiseless application of the activation function would yield zero gradient.
Large noise will dominate the noisefree gradient and allow stochastic gradient
descent toexplore more. By adding noise only to the problematic parts of the
activation function, we allow the optimization procedure to explore the
boundary between the degenerate (saturating) and the wellbehaved parts of the
activation function. We also establish connections to simulated annealing, when
the amount of noise is annealed down, making it easier to optimize hard
objective functions. We find experimentally that replacing such saturating
activation functions by noisy variants helps training in many contexts,
yielding stateoftheart or competitive results on different datasets and
task, especially when training seems to be the most difficult, e.g., when
curriculum learning is necessary to obtain good results.

Policies for complex visual tasks have been successfully learned with deep
reinforcement learning, using an approach called deep Qnetworks (DQN), but
relatively large (taskspecific) networks and extensive training are needed to
achieve good performance. In this work, we present a novel method called policy
distillation that can be used to extract the policy of a reinforcement learning
agent and train a new network that performs at the expert level while being
dramatically smaller and more efficient. Furthermore, the same method can be
used to consolidate multiple taskspecific policies into a single policy. We
demonstrate these claims using the Atari domain and show that the multitask
distilled agent outperforms the singletask teachers as well as a
jointlytrained DQN agent.

Stochastic gradient algorithms have been the main focus of largescale
learning problems and they led to important successes in machine learning. The
convergence of SGD depends on the careful choice of learning rate and the
amount of the noise in stochastic estimates of the gradients. In this paper, we
propose a new adaptive learning rate algorithm, which utilizes curvature
information for automatically tuning the learning rates. The information about
the elementwise curvature of the loss function is estimated from the local
statistics of the stochastic first order gradients. We further propose a new
variance reduction technique to speed up the convergence. In our preliminary
experiments with deep neural networks, we obtained better performance compared
to the popular stochastic gradient algorithms.

In this work, we propose a novel recurrent neural network (RNN) architecture.
The proposed RNN, gatedfeedback RNN (GFRNN), extends the existing approach of
stacking multiple recurrent layers by allowing and controlling signals flowing
from upper recurrent layers to lower layers using a global gating unit for each
pair of layers. The recurrent signals exchanged between layers are gated
adaptively based on the previous hidden states and the current input. We
evaluated the proposed GFRNN with different types of recurrent units, such as
tanh, long shortterm memory and gated recurrent units, on the tasks of
characterlevel language modeling and Python program evaluation. Our empirical
evaluation of different RNN units, revealed that in both tasks, the GFRNN
outperforms the conventional approaches to build deep stacked RNNs. We suggest
that the improvement arises because the GFRNN can adaptively assign different
layers to different timescales and layertolayer interactions (including the
topdown ones which are not usually present in a stacked RNN) by learning to
gate these interactions.

Recent work on endtoend neural networkbased architectures for machine
translation has shown promising results for EnFr and EnDe translation.
Arguably, one of the major factors behind this success has been the
availability of high quality parallel corpora. In this work, we investigate how
to leverage abundant monolingual corpora for neural machine translation.
Compared to a phrasebased and hierarchical baseline, we obtain up to $1.96$
BLEU improvement on the lowresource language pair TurkishEnglish, and $1.59$
BLEU on the focused domain task of ChineseEnglish chat messages. While our
method was initially targeted toward such tasks with less parallel data, we
show that it also extends to high resource languages such as CsEn and DeEn
where we obtain an improvement of $0.39$ and $0.47$ BLEU scores over the neural
machine translation baselines, respectively.

The task of the emotion recognition in the wild (EmotiW) Challenge is to
assign one of seven emotions to short video clips extracted from Hollywood
style movies. The videos depict actedout emotions under realistic conditions
with a large degree of variation in attributes such as pose and illumination,
making it worthwhile to explore approaches which consider combinations of
features from multiple modalities for label assignment. In this paper we
present our approach to learning several specialist models using deep learning
techniques, each focusing on one modality. Among these are a convolutional
neural network, focusing on capturing visual information in detected faces, a
deep belief net focusing on the representation of the audio stream, a KMeans
based "bagofmouths" model, which extracts visual features around the mouth
region and a relational autoencoder, which addresses spatiotemporal aspects of
videos. We explore multiple methods for the combination of cues from these
modalities into one common classifier. This achieves a considerably greater
accuracy than predictions from our strongest singlemodality classifier. Our
method was the winning submission in the 2013 EmotiW challenge and achieved a
test set accuracy of 47.67% on the 2014 dataset.

In this paper we compare different types of recurrent units in recurrent
neural networks (RNNs). Especially, we focus on more sophisticated units that
implement a gating mechanism, such as a long shortterm memory (LSTM) unit and
a recently proposed gated recurrent unit (GRU). We evaluate these recurrent
units on the tasks of polyphonic music modeling and speech signal modeling. Our
experiments revealed that these advanced recurrent units are indeed better than
more traditional recurrent units such as tanh units. Also, we found GRU to be
comparable to LSTM.

In this paper, we propose a novel neural network model called RNN
EncoderDecoder that consists of two recurrent neural networks (RNN). One RNN
encodes a sequence of symbols into a fixedlength vector representation, and
the other decodes the representation into another sequence of symbols. The
encoder and decoder of the proposed model are jointly trained to maximize the
conditional probability of a target sequence given a source sequence. The
performance of a statistical machine translation system is empirically found to
improve by using the conditional probabilities of phrase pairs computed by the
RNN EncoderDecoder as an additional feature in the existing loglinear model.
Qualitatively, we show that the proposed model learns a semantically and
syntactically meaningful representation of linguistic phrases.

In this paper we propose and investigate a novel nonlinear unit, called $L_p$
unit, for deep neural networks. The proposed $L_p$ unit receives signals from
several projections of a subset of units in the layer below and computes a
normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$
unit. First, the proposed unit can be understood as a generalization of a
number of conventional pooling operators such as average, rootmeansquare and
max pooling widely used in, for instance, convolutional neural networks (CNN),
HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain
degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013)
which achieved the stateoftheart object recognition results on a number of
benchmark datasets. Secondly, we provide a geometrical interpretation of the
activation function based on which we argue that the $L_p$ unit is more
efficient at representing complex, nonlinear separating boundaries. Each $L_p$
unit defines a superelliptic boundary, with its exact shape defined by the
order $p$. We claim that this makes it possible to model arbitrarily shaped,
curved boundaries more efficiently by combining a few $L_p$ units of different
orders. This insight justifies the need for learning different orders for each
unit in the model. We empirically evaluate the proposed $L_p$ units on a number
of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$
units achieve the stateoftheart results on a number of benchmark datasets.
Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep
recurrent neural networks (RNN).

A central challenge to many fields of science and engineering involves
minimizing nonconvex error functions over continuous, high dimensional spaces.
Gradient descent or quasiNewton methods are almost ubiquitously used to
perform such minimizations, and it is often thought that a main source of
difficulty for these local methods to find the global minimum is the
proliferation of local minima with much higher error than the global minimum.
Here we argue, based on results from statistical physics, random matrix theory,
neural network theory, and empirical evidence, that a deeper and more profound
difficulty originates from the proliferation of saddle points, not local
minima, especially in high dimensional problems of practical interest. Such
saddle points are surrounded by high error plateaus that can dramatically slow
down learning, and give the illusory impression of the existence of a local
minimum. Motivated by these arguments, we propose a new approach to
secondorder optimization, the saddlefree Newton method, that can rapidly
escape high dimensional saddle points, unlike gradient descent and quasiNewton
methods. We apply this algorithm to deep or recurrent neural network training,
and provide numerical evidence for its superior optimization performance.

In this paper, we explore different ways to extend a recurrent neural network
(RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in
an RNN is not as clear as it is in feedforward neural networks. By carefully
analyzing and understanding the architecture of an RNN, however, we find three
points of an RNN which may be made deeper; (1) inputtohidden function, (2)
hiddentohidden transition and (3) hiddentooutput function. Based on this
observation, we propose two novel architectures of a deep RNN which are
orthogonal to an earlier attempt of stacking multiple recurrent layers to build
a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an
alternative interpretation of these deep RNNs using a novel framework based on
neural operators. The proposed deep RNNs are empirically evaluated on the tasks
of polyphonic music prediction and language modeling. The experimental result
supports our claim that the proposed deep RNNs benefit from the depth and
outperform the conventional, shallow RNNs.

We explore the effect of introducing prior information into the intermediate
level of neural networks for a learning task on which all the stateoftheart
machine learning algorithms tested failed to learn. We motivate our work from
the hypothesis that humans learn such intermediate concepts from other
individuals via a form of supervision or guidance using a curriculum. The
experiments we have conducted provide positive evidence in favor of this
hypothesis. In our experiments, a twotiered MLP architecture is trained on a
dataset with 64x64 binary inputs images, each image with three sprites. The
final task is to decide whether all the sprites are the same or one of them is
different. Sprites are pentomino tetris shapes and they are placed in an image
with different locations using scaling and rotation transformations. The first
part of the twotiered MLP is pretrained with intermediatelevel targets being
the presence of sprites at each location, while the second part takes the
output of the first part as input and predicts the final task's target binary
event. The twotiered MLP architecture, with a few tens of thousand examples,
was able to learn the task perfectly, whereas all other algorithms (include
unsupervised pretraining, but also traditional algorithms like SVMs, decision
trees and boosting) all perform no better than chance. We hypothesize that the
optimization difficulty involved when the intermediate pretraining is not
performed is due to the {\em composition} of two highly nonlinear tasks. Our
findings are also consistent with hypotheses on cultural learning inspired by
the observations of optimization problems with deep learning, presumably
because of effective local minima.