• This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.
  • We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using \emph{symbolic} input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input. The MMDA architecture attempts to eliminate the need for an external LM, by enabling seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on CER and achieves 8-10\% relative WER improvement on the WSJ data set.
  • We describe the system our team used during NIST's LoReHLT (Low Resource Human Language Technologies) 2017 Evaluations, which evaluated document topic classification. We present a language agnostic approach combining universal acoustic modeling, evaluation-language-to-English machine translation (MT) and an English-language topic classifier. This combination requires no transcribed speech in the given evaluation language, nor even in a related language. We also examine the benefits of system adaptation from various collected resources. The two evaluation languages (incident languages by the LORELEI terminology) were Tigrinya (IL5) and Oromo (IL6) and for both our system performed well.
  • Modern topic identification (topic ID) systems for speech use automatic speech recognition (ASR) to produce speech transcripts, and perform supervised classification on such ASR outputs. However, under resource-limited conditions, the manually transcribed speech required to develop standard ASR systems can be severely limited or unavailable. In this paper, we investigate alternative unsupervised solutions to obtaining tokenizations of speech in terms of a vocabulary of automatically discovered word-like or phoneme-like units, without depending on the supervised training of ASR systems. Moreover, using automatic phoneme-like tokenizations, we demonstrate that a convolutional neural network based framework for learning spoken document representations provides competitive performance compared to a standard bag-of-words representation, as evidenced by comprehensive topic ID evaluations on both single-label and multi-label classification tasks.
  • We report the discovery of seven new, very bright gravitational lens systems from our ongoing gravitational lens search, the Sloan Bright Arcs Survey (SBAS). Two of the systems are confirmed to have high source redshifts z=2.19 and z=2.94. Three other systems lie at intermediate redshift with z=1.33,1.82,1.93 and two systems are at low redshift z=0.66,0.86. The lensed source galaxies in all of these systems are bright, with i-band magnitudes ranging from 19.73-22.06. We present the spectrum of each of the source galaxies in these systems along with estimates of the Einstein radius for each system. The foreground lens in most systems is identified by a red sequence based cluster finder as a galaxy group; one system is identified as a moderately rich cluster. In total the SBAS has now discovered 19 strong lens systems in the SDSS imaging data, 8 of which are among the highest surface brightness z\simeq2-3 galaxies known.