-
Sequential models achieve state-of-the-art results in audio, visual and
textual domains with respect to both estimating the data distribution and
generating high-quality samples. Efficient sampling for this class of models
has however remained an elusive problem. With a focus on text-to-speech
synthesis, we describe a set of general techniques for reducing sampling time
while maintaining high output quality. We first describe a single-layer
recurrent neural network, the WaveRNN, with a dual softmax layer that matches
the quality of the state-of-the-art WaveNet model. The compact form of the
network makes it possible to generate 24kHz 16-bit audio 4x faster than real
time on a GPU. Second, we apply a weight pruning technique to reduce the number
of weights in the WaveRNN. We find that, for a constant number of parameters,
large sparse networks perform better than small dense networks and this
relationship holds for sparsity levels beyond 96%. The small number of weights
in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile
CPU in real time. Finally, we propose a new generation scheme based on
subscaling that folds a long sequence into a batch of shorter sequences and
allows one to generate multiple samples at once. The Subscale WaveRNN produces
16 samples per step without loss of quality and offers an orthogonal method for
increasing sampling efficiency.
-
Generative models in vision have seen rapid progress due to algorithmic
improvements and the availability of high-quality image datasets. In this
paper, we offer contributions in both these areas to enable similar progress in
audio modeling. First, we detail a powerful new WaveNet-style autoencoder model
that conditions an autoregressive decoder on temporal codes learned from the
raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality
dataset of musical notes that is an order of magnitude larger than comparable
public datasets. Using NSynth, we demonstrate improved qualitative and
quantitative performance of the WaveNet autoencoder over a well-tuned spectral
autoencoder baseline. Finally, we show that the model learns a manifold of
embeddings that allows for morphing between instruments, meaningfully
interpolating in timbre to create new types of sounds that are realistic and
expressive.
-
This paper introduces WaveNet, a deep neural network for generating raw audio
waveforms. The model is fully probabilistic and autoregressive, with the
predictive distribution for each audio sample conditioned on all previous ones;
nonetheless we show that it can be efficiently trained on data with tens of
thousands of samples per second of audio. When applied to text-to-speech, it
yields state-of-the-art performance, with human listeners rating it as
significantly more natural sounding than the best parametric and concatenative
systems for both English and Mandarin. A single WaveNet can capture the
characteristics of many different speakers with equal fidelity, and can switch
between them by conditioning on the speaker identity. When trained to model
music, we find that it generates novel and often highly realistic musical
fragments. We also show that it can be employed as a discriminative model,
returning promising results for phoneme recognition.
-
Many classes of images exhibit rotational symmetry. Convolutional neural
networks are sometimes trained using data augmentation to exploit this, but
they are still required to learn the rotation equivariance properties from the
data. Encoding these properties into the network architecture, as we are
already used to doing for translation equivariance by using convolutional
layers, could result in a more efficient use of the parameter budget by
relieving the model from learning them. We introduce four operations which can
be inserted into neural network models as layers, and which can be combined to
make these models partially equivariant to rotations. They also enable
parameter sharing across different orientations. We evaluate the effect of
these architectural modifications on three datasets which exhibit rotational
symmetry and demonstrate improved performance with smaller models.
-
Theano is a Python library that allows to define, optimize, and evaluate
mathematical expressions involving multi-dimensional arrays efficiently. Since
its introduction, it has been one of the most used CPU and GPU mathematical
compilers - especially in the machine learning community - and has shown steady
performance improvements. Theano is being actively and continuously developed
since 2008, multiple frameworks have been built on top of it and it has been
used to produce many state-of-the-art machine learning models.
The present article is structured as follows. Section I provides an overview
of the Theano software and its community. Section II presents the principal
features of Theano and how to use them, and compares them with other similar
projects. Section III focuses on recently-introduced functionalities and
improvements. Section IV compares the performance of Theano against Torch7 and
TensorFlow on several machine learning models. Section V discusses current
limitations of Theano and potential ways of improving it.
-
Recent studies have demonstrated the power of recurrent neural networks for
machine translation, image captioning and speech recognition. For the task of
capturing temporal structure in video, however, there still remain numerous
open research questions. Current research suggests using a simple temporal
feature pooling strategy to take into account the temporal aspect of video. We
demonstrate that this method is not sufficient for gesture recognition, where
temporal information is more discriminative compared to general video
classification tasks. We explore deep architectures for gesture recognition in
video and propose a new end-to-end trainable neural network architecture
incorporating temporal convolutions and bidirectional recurrence. Our main
contributions are twofold; first, we show that recurrence is crucial for this
task; second, we show that adding temporal convolutions leads to significant
improvements. We evaluate the different approaches on the Montalbano gesture
recognition dataset, where we achieve state-of-the-art results.
-
Measuring the morphological parameters of galaxies is a key requirement for
studying their formation and evolution. Surveys such as the Sloan Digital Sky
Survey (SDSS) have resulted in the availability of very large collections of
images, which have permitted population-wide analyses of galaxy morphology.
Morphological analysis has traditionally been carried out mostly via visual
inspection by trained experts, which is time-consuming and does not scale to
large ($\gtrsim10^4$) numbers of images.
Although attempts have been made to build automated classification systems,
these have not been able to achieve the desired level of accuracy. The Galaxy
Zoo project successfully applied a crowdsourcing strategy, inviting online
users to classify images by answering a series of questions. Unfortunately,
even this approach does not scale well enough to keep up with the increasing
availability of galaxy images.
We present a deep neural network model for galaxy morphology classification
which exploits translational and rotational symmetry. It was developed in the
context of the Galaxy Challenge, an international competition to build the best
model for morphology classification based on annotated images from the Galaxy
Zoo project.
For images with high agreement among the Galaxy Zoo participants, our model
is able to reproduce their consensus with near-perfect accuracy ($> 99\%$) for
most questions. Confident model predictions are highly accurate, which makes
the model suitable for filtering large collections of images and forwarding
challenging images to experts for manual annotation. This approach greatly
reduces the experts' workload without affecting accuracy. The application of
these algorithms to larger sets of training data will be critical for analysing
results from future surveys such as the LSST.