• Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-theart ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We also introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12, 500 hour voice search task, we find that the proposed changes improve the WER from 9.2% to 5.6%, while the best conventional system achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to 5% for the conventional system.
  • Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages.
  • We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We explore various model architectures and demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an `encoder', which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and a `decoder' which is partially initialized from a recurrent neural network language model trained on text data alone. The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance can be improved further through the use of sub-word units (`wordpieces') which capture longer context and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000 wordpieces as output targets achieves a word error rate of 8.5\% on voice-search and 5.2\% on voice-dictation tasks and is comparable to a state-of-the-art baseline at 8.3\% on voice-search and 5.4\% voice-dictation.
  • We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.
  • We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.
  • We estimate the spatial locations of sources of the the observed features in the Fermi-LAT photon spectrum at $E_\gamma=110$ and $E_\gamma=130$ GeV. We determine whether they are consistent with emission from a single source, as would be expected in their interpretation as $\gamma\gamma$ and $\gamma Z$ lines from dark matter annhiliation, as well as whether they are consistent with a dark matter halo positioned at the center of the galaxy. We take advantage of the per-photon measured incident angle in reconstructing the line features. In addition, we use a data-driven background model rather than making the assumption of a feature-less background. We localize the sources of the features at 110 and 130 GeV. Assuming an Einasto (NFW) density model we find the 130 GeV line to be offset from the galactic center by 285 (280) pc, the 110 GeV line by 60 (30) pc with a large relative separation of 220 (240) pc. However, we find this displacement of each source from the galactic center, as well as their relative displacement to be statistically consistent with a single Einasto or NFW dark matter halo at the center of the galaxy.
  • Experimental analysis of data from particle collisions is typically expressed as statistical limits on a few benchmark models of particular, often historical, interest. The implications of the data for other theoretical models (current or future) may be powerful, but they cannot typically be calculated from the published information, except in the simplest case of a single-bin counting experiment. We present a novel solution to this long-standing problem by expressing the new model as a linear combination of models from published experimental analysis, allowing for the trivial calculation of limits on a nearly arbitrary model. We present tests in simple toy experiments, demonstrate self-consistency by using published results to reproduce other published results on the same spectrum, and provide a reinterpretation of a search for chiral down-type heavy quarks ($b'$) in terms of a search for an exotic heavy quark ($T$) with similar but distinct phenomenology. We find $m_T>419$ GeV at 95% CL, currently the strongest limits if the $T$ quark decays via $T\rightarrow Wb, T\rightarrow tZ$ and $T\rightarrow tH$.
  • Limits on an exotic heavy quark $T$ are broadly generalized by considering the full range of $T\rightarrow Wb, th$ or $tZ$ branching ratios. We combine results of specific $T\rightarrow tZ$ and $T\rightarrow Wb$ searches with limits on various combinations of decay modes evaluated by re-interpreting other searches. We find strong bounds across the entire space of branching ratios, ranging from $m_T > 415$ GeV to $m_T > 557$ GeV at 95% confidence level.