-
This paper introduces a new open source platform for end-to-end speech
processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech
recognition (ASR), and adopts widely-used dynamic neural network toolkits,
Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the
Kaldi ASR toolkit style for data processing, feature extraction/format, and
recipes to provide a complete setup for speech recognition and other speech
processing experiments. This paper explains a major architecture of this
software platform, several important functionalities, which differentiate
ESPnet from other open source ASR toolkits, and experimental results with major
ASR benchmarks.
-
We present a new end-to-end architecture for automatic speech recognition
(ASR) that can be trained using \emph{symbolic} input in addition to the
traditional acoustic input. This architecture utilizes two separate encoders:
one for acoustic input and another for symbolic input, both sharing the
attention and decoder parameters. We call this architecture a multi-modal data
augmentation network (MMDA), as it can support multi-modal (acoustic and
symbolic) input. The MMDA architecture attempts to eliminate the need for an
external LM, by enabling seamless mixing of large text datasets with
significantly smaller transcribed speech corpora during training. We study
different ways of transforming large text corpora into a symbolic form suitable
for training our MMDA network. Our best MMDA setup obtains small improvements
on CER and achieves 8-10\% relative WER improvement on the WSJ data set.
-
We describe the system our team used during NIST's LoReHLT (Low Resource
Human Language Technologies) 2017 Evaluations, which evaluated document topic
classification. We present a language agnostic approach combining universal
acoustic modeling, evaluation-language-to-English machine translation (MT) and
an English-language topic classifier. This combination requires no transcribed
speech in the given evaluation language, nor even in a related language. We
also examine the benefits of system adaptation from various collected
resources. The two evaluation languages (incident languages by the LORELEI
terminology) were Tigrinya (IL5) and Oromo (IL6) and for both our system
performed well.
-
Modern topic identification (topic ID) systems for speech use automatic
speech recognition (ASR) to produce speech transcripts, and perform supervised
classification on such ASR outputs. However, under resource-limited conditions,
the manually transcribed speech required to develop standard ASR systems can be
severely limited or unavailable. In this paper, we investigate alternative
unsupervised solutions to obtaining tokenizations of speech in terms of a
vocabulary of automatically discovered word-like or phoneme-like units, without
depending on the supervised training of ASR systems. Moreover, using automatic
phoneme-like tokenizations, we demonstrate that a convolutional neural network
based framework for learning spoken document representations provides
competitive performance compared to a standard bag-of-words representation, as
evidenced by comprehensive topic ID evaluations on both single-label and
multi-label classification tasks.
-
We report the discovery of seven new, very bright gravitational lens systems
from our ongoing gravitational lens search, the Sloan Bright Arcs Survey
(SBAS). Two of the systems are confirmed to have high source redshifts z=2.19
and z=2.94. Three other systems lie at intermediate redshift with
z=1.33,1.82,1.93 and two systems are at low redshift z=0.66,0.86. The lensed
source galaxies in all of these systems are bright, with i-band magnitudes
ranging from 19.73-22.06. We present the spectrum of each of the source
galaxies in these systems along with estimates of the Einstein radius for each
system. The foreground lens in most systems is identified by a red sequence
based cluster finder as a galaxy group; one system is identified as a
moderately rich cluster. In total the SBAS has now discovered 19 strong lens
systems in the SDSS imaging data, 8 of which are among the highest surface
brightness z\simeq2-3 galaxies known.