-
There is growing interest in models that can learn from unlabelled speech
paired with visual context. This setting is relevant for low-resource speech
processing, robotics, and human language acquisition research. Here we study
how a visually grounded speech model, trained on images of scenes paired with
spoken captions, captures aspects of semantics. We use an external image tagger
to generate soft text labels from images, which serve as targets for a neural
model that maps untranscribed speech to (semantic) keyword labels. We introduce
a newly collected data set of human semantic relevance judgements and an
associated task, semantic speech retrieval, where the goal is to search for
spoken utterances that are semantically relevant to a given text query. Without
seeing any text, the model trained on parallel speech and images achieves a
precision of almost 60% on its top ten semantic retrievals. Compared to a
supervised model trained on transcriptions, our model matches human judgements
better by some measures, especially in retrieving non-verbatim semantic
matches. We perform an extensive analysis of the model and its resulting
representations.
-
In music domain, feature learning has been conducted mainly in two ways:
unsupervised learning based on sparse representations or supervised learning by
semantic labels such as music genre. However, finding discriminative features
in an unsupervised way is challenging and supervised feature learning using
semantic labels may involve noisy or expensive annotation. In this paper, we
present a supervised feature learning approach using artist labels annotated in
every single track as objective meta data. We propose two deep convolutional
neural networks (DCNN) to learn the deep artist features. One is a plain DCNN
trained with the whole artist labels simultaneously, and the other is a Siamese
DCNN trained with a subset of the artist labels based on the artist identity.
We apply the trained models to music classification and retrieval tasks in
transfer learning settings. The results show that our approach is comparable to
previous state-of-the-art methods, indicating that the proposed approach
captures general music audio features as much as the models learned with
semantic labels. Also, we discuss the advantages and disadvantages of the two
models.
-
Despite noise suppression being a mature area in signal processing, it
remains highly dependent on fine tuning of estimator algorithms and parameters.
In this paper, we demonstrate a hybrid DSP/deep learning approach to noise
suppression. A deep neural network with four hidden layers is used to estimate
ideal critical band gains, while a more traditional pitch filter attenuates
noise between pitch harmonics. The approach achieves significantly higher
quality than a traditional minimum mean squared error spectral estimator, while
keeping the complexity low enough for real-time operation at 48 kHz on a
low-power processor.
-
End-To-End speech recognition have become increasingly popular in mandarin
speech recognition and achieved delightful performance.
Mandarin is a tonal language which is different from English and requires
special treatment for the acoustic modeling units. There have been several
different kinds of modeling units for mandarin such as phoneme, syllable and
Chinese character.
In this work, we explore two major end-to-end models: connectionist temporal
classification (CTC) model and attention based encoder-decoder model for
mandarin speech recognition. We compare the performance of three different
scaled modeling units: context dependent phoneme(CDP), syllable with tone and
Chinese character.
We find that all types of modeling units can achieve approximate character
error rate (CER) in CTC model and the performance of Chinese character
attention model is better than syllable attention model. Furthermore, we find
that Chinese character is a reasonable unit for mandarin speech recognition. On
DidiCallcenter task, Chinese character attention model achieves a CER of 5.68\%
and CTC model gets a CER of 7.29\%, on the other DidiReading task, CER are
4.89\% and 5.79\%, respectively. Moreover, attention model achieves a better
performance than CTC model on both datasets.
-
Sound event detection systems typically consist of two stages: extracting
hand-crafted features from the raw audio waveform, and learning a mapping
between these features and the target sound events using a classifier.
Recently, the focus of sound event detection research has been mostly shifted
to the latter stage using standard features such as mel spectrogram as the
input for classifiers such as deep neural networks. In this work, we utilize
end-to-end approach and propose to combine these two stages in a single deep
neural network classifier. The feature extraction over the raw waveform is
conducted by a feedforward layer block, whose parameters are initialized to
extract the time-frequency representations. The feature extraction parameters
are updated during training, resulting with a representation that is optimized
for the specific task. This feature extraction block is followed by (and
jointly trained with) a convolutional recurrent network, which has recently
given state-of-the-art results in many sound recognition tasks. The proposed
system does not outperform a convolutional recurrent network with fixed
hand-crafted features. The final magnitude spectrum characteristics of the
feature extraction block parameters indicate that the most relevant information
for the given task is contained in 0 - 3 kHz frequency range, and this is also
supported by the empirical results on the SED performance.
-
Traditional intelligent fault diagnosis of rolling bearings work well only
under a common assumption that the labeled training data (source domain) and
unlabeled testing data (target domain) are drawn from the same distribution.
However, in many real-world applications, this assumption does not hold,
especially when the working condition varies. In this paper, a new adversarial
adaptive 1-D CNN called A2CNN is proposed to address this problem. A2CNN
consists of four parts, namely, a source feature extractor, a target feature
extractor, a label classifier and a domain discriminator. The layers between
the source and target feature extractor are partially untied during the
training stage to take both training efficiency and domain adaptation into
consideration. Experiments show that A2CNN has strong fault-discriminative and
domain-invariant capacity, and therefore can achieve high accuracy under
different working conditions. We also visualize the learned features and the
networks to explore the reasons behind the high performance of our proposed
model.
-
The computer vision literature shows that randomly weighted neural networks
perform reasonably as feature extractors. Following this idea, we study how
non-trained (randomly weighted) convolutional neural networks perform as
feature extractors for (music) audio classification tasks. We use features
extracted from the embeddings of deep architectures as input to a classifier -
with the goal to compare classification accuracies when using different
randomly weighted architectures. By following this methodology, we run a
comprehensive evaluation of the current deep architectures for audio
classification, and provide evidence that the architectures alone are an
important piece for resolving (music) audio problems using deep neural
networks.
-
Designing a spoken language understanding system for command-and-control
applications can be challenging because of a wide variety of domains and users
or because of a lack of training data. In this paper we discuss a system that
learns from scratch from user demonstrations. This method has the advantage
that the same system can be used for many domains and users without
modifications and that no training data is required prior to deployment. The
user is required to train the system, so for a user friendly experience it is
crucial to minimize the required amount of data. In this paper we investigate
whether a capsule network can make efficient use of the limited amount of
available training data. We compare the proposed model to an approach based on
Non-negative Matrix Factorisation which is the state-of-the-art in this setting
and another deep learning approach that was recently introduced for end-to-end
spoken language understanding. We show that the proposed model outperforms the
baseline models for three command-and-control applications: controlling a small
robot, a vocally guided card game and a home automation task.
-
There is debate if phoneme or viseme units are the most effective for a
lipreading system. Some studies use phoneme units even though phonemes describe
unique short sounds; other studies tried to improve lipreading accuracy by
focusing on visemes with varying results. We compare the performance of a
lipreading system by modeling visual speech using either 13 viseme or 38
phoneme units. We report the accuracy of our system at both word and unit
levels. The evaluation task is large vocabulary continuous speech using the
TCD-TIMIT corpus. We complete our visual speech modeling via hybrid DNN-HMMs
and our visual speech decoder is a Weighted Finite-State Transducer (WFST). We
use DCT and Eigenlips as a representation of mouth ROI image. The phoneme
lipreading system word accuracy outperforms the viseme based system word
accuracy. However, the phoneme system achieved lower accuracy at the unit level
which shows the importance of the dictionary for decoding classification
outputs into words.
-
Visemes are the visual equivalent of phonemes. Although not precisely
defined, a working definition of a viseme is "a set of phonemes which have
identical appearance on the lips". Therefore a phoneme falls into one viseme
class but a viseme may represent many phonemes: a many to one mapping. This
mapping introduces ambiguity between phonemes when using viseme classifiers.
Not only is this ambiguity damaging to the performance of audio-visual
classifiers operating on real expressive speech, there is also considerable
choice between possible mappings. In this paper we explore the issue of this
choice of viseme-to-phoneme map. We show that there is definite difference in
performance between viseme-to-phoneme mappings and explore why some maps appear
to work better than others. We also devise a new algorithm for constructing
phoneme-to-viseme mappings from labeled speech data. These new visemes, `Bear'
visemes, are shown to perform better than previously known units.
-
Visual lip gestures observed whilst lipreading have a few working
definitions, the most common two are; `the visual equivalent of a phoneme' and
`phonemes which are indistinguishable on the lips'. To date there is no formal
definition, in part because to date we have not established a two-way
relationship or mapping between visemes and phonemes. Some evidence suggests
that visual speech is highly dependent upon the speaker. So here, we use a
phoneme-clustering method to form new phoneme-to-viseme maps for both
individual and multiple speakers. We test these phoneme to viseme maps to
examine how similarly speakers talk visually and we use signed rank tests to
measure the distance between individuals. We conclude that broadly speaking,
speakers have the same repertoire of mouth gestures, where they differ is in
the use of the gestures.
-
The fundamental frequency (F0) contour of speech is a key aspect to represent
speech prosody that finds use in speech and spoken language analysis such as
voice conversion and speech synthesis as well as speaker and language
identification. This work proposes new methods to estimate the F0 contour of
speech using deep neural networks (DNNs) and recurrent neural networks (RNNs).
They are trained using supervised learning with the ground truth of F0
contours. The latest prior research addresses this problem first as a
frame-by-frame-classification problem followed by sequence tracking using deep
neural network hidden Markov model (DNN-HMM) hybrid architecture. This study,
however, tackles the problem as a regression problem instead, in order to
obtain F0 contours with higher frequency resolution from clean and noisy
speech. Experiments using PTDB-TUG corpus contaminated with additive noise
(NOISEX-92) show the proposed method improves gross pitch error (GPE) by more
than 25 % at signal-to-noise ratios (SNRs) between -10 dB and +10 dB as
compared with one of the most noise-robust F0 trackers, PEFAC. Furthermore, the
performance on fine pitch error (FPE) is improved by approximately 20 % against
a state-of-the-art DNN-HMM-based approach.
-
Children speech recognition is challenging mainly due to the inherent high
variability in children's physical and articulatory characteristics and
expressions. This variability manifests in both acoustic constructs and
linguistic usage due to the rapidly changing developmental stage in children's
life. Part of the challenge is due to the lack of large amounts of available
children speech data for efficient modeling. This work attempts to address the
key challenges using transfer learning from adult's models to children's models
in a Deep Neural Network (DNN) framework for children's Automatic Speech
Recognition (ASR) task evaluating on multiple children's speech corpora with a
large vocabulary. The paper presents a systematic and an extensive analysis of
the proposed transfer learning technique considering the key factors affecting
children's speech recognition from prior literature. Evaluations are presented
on (i) comparisons of earlier GMM-HMM and the newer DNN Models, (ii)
effectiveness of standard adaptation techniques versus transfer learning, (iii)
various adaptation configurations in tackling the variabilities present in
children speech, in terms of (a) acoustic spectral variability, and (b)
pronunciation variability and linguistic constraints. Our Analysis spans over
(i) number of DNN model parameters (for adaptation), (ii) amount of adaptation
data, (iii) ages of children, (iv) age dependent-independent adaptation.
Finally, we provide Recommendations on (i) the favorable strategies over
various aforementioned - analyzed parameters, and (ii) potential future
research directions and relevant challenges/problems persisting in DNN based
ASR for children's speech.
-
Deep neural networks have become an indispensable technique for audio source
separation (ASS). It was recently reported that a variant of CNN architecture
called MMDenseNet was successfully employed to solve the ASS problem of
estimating source amplitudes, and state-of-the-art results were obtained for
DSD100 dataset. To further enhance MMDenseNet, here we propose a novel
architecture that integrates long short-term memory (LSTM) in multiple scales
with skip connections to efficiently model long-term structures within an audio
context. The experimental results show that the proposed method outperforms
MMDenseNet, LSTM and a blend of the two networks. The number of parameters and
processing time of the proposed model are significantly less than those for
simple blending. Furthermore, the proposed method yields better results than
those obtained using ideal binary masks for a singing voice separation task.
-
In this paper, we present a machine-learning approach to pitch correction for
voice in a karaoke setting, where the vocals and accompaniment are on separate
tracks and time-aligned. The network takes as input the time-frequency
representation of the two tracks and predicts the amount of pitch-shifting in
cents required to make the voice sound in-tune with the accompaniment. It is
trained on examples of semi-professional singing. The proposed approach differs
from existing real-time pitch correction methods by replacing pitch tracking
and mapping to a discrete set of notes---for example, the twelve classes of the
equal-tempered scale---with learning a correction that is continuous both in
frequency and in time directly from the harmonics of the vocal and
accompaniment tracks. A Recurrent Neural Network (RNN) model provides a
correction that takes context into account, preserving expressive pitch bending
and vibrato. This method can be extended into unsupervised pitch correction of
a vocal performance---popularly referred to as autotuning.
-
The automated recognition of music genres from audio information is a
challenging problem, as genre labels are subjective and noisy. Artist labels
are less subjective and less noisy, while certain artists may relate more
strongly to certain genres. At the same time, at prediction time, it is not
guaranteed that artist labels are available for a given audio segment.
Therefore, in this work, we propose to apply the transfer learning framework,
learning artist-related information which will be used at inference time for
genre classification. We consider different types of artist-related
information, expressed through artist group factors, which will allow for more
efficient learning and stronger robustness to potential label noise.
Furthermore, we investigate how to achieve the highest validation accuracy on
the given FMA dataset, by experimenting with various kinds of transfer methods,
including single-task transfer, multi-task transfer and finally multi-task
learning.
-
The recently proposed relaxed binaural beamforming (RBB) optimization problem
provides a flexible trade-off between noise suppression and binaural-cue
preservation of the sound sources in the acoustic scene. It minimizes the
output noise power, under the constraints which guarantee that the target
remains unchanged after processing and the binaural-cue distortions of the
acoustic sources will be less than a user-defined threshold. However, the RBB
problem is a computationally demanding non-convex optimization problem. The
only existing suboptimal method which approximately solves the RBB is a
successive convex optimization (SCO) method which, typically, requires to solve
multiple convex optimization problems per frequency bin, in order to converge.
Convergence is achieved when all constraints of the RBB optimization problem
are satisfied. In this paper, we propose a semi-definite convex relaxation
(SDCR) of the RBB optimization problem. The proposed suboptimal SDCR method
solves a single convex optimization problem per frequency bin, resulting in a
much lower computational complexity than the SCO method. Unlike the SCO method,
the SDCR method does not guarantee user-controlled upper-bounded binaural-cue
distortions. To tackle this problem we also propose a suboptimal hybrid method
which combines the SDCR and SCO methods. Instrumental measures combined with a
listening test show that the SDCR and hybrid methods achieve significantly
lower computational complexity than the SCO method, and in most cases better
trade-off between predicted intelligibility and binaural-cue preservation than
the SCO method.
-
The front-end factor analysis (FEFA), an extension of principal component
analysis (PPCA) tailored to be used with Gaussian mixture models (GMMs), is
currently the prevalent approach to extract compact utterance-level features
(i-vectors) for automatic speaker verification (ASV) systems. Little research
has been conducted comparing FEFA to the conventional PPCA applied to maximum a
posteriori (MAP) adapted GMM supervectors. We study several alternative
methods, including PPCA, factor analysis (FA), and two supervised approaches,
supervised PPCA (SPPCA) and the recently proposed probabilistic partial least
squares (PPLS), to compress MAP-adapted GMM supervectors. The resulting
i-vectors are used in ASV tasks with a probabilistic linear discriminant
analysis (PLDA) back-end. We experiment on two different datasets, on the
telephone condition of NIST SRE 2010 and on the recent VoxCeleb corpus
collected from YouTube videos containing celebrity interviews recorded in
various acoustical and technical conditions. The results suggest that, in terms
of ASV accuracy, the supervector compression approaches are on a par with FEFA.
The supervised approaches did not result in improved performance. In comparison
to FEFA, we obtained more than hundred-fold (100x) speedups in the total
variability model (TVM) training using the PPCA and FA supervector compression
approaches.
-
Reduction of unwanted environmental noises is an important feature of today's
hearing aids (HA), which is why noise reduction is nowadays included in almost
every commercially available device. The majority of these algorithms, however,
is restricted to the reduction of stationary noises. In this work, we propose a
denoising approach based on a three hidden layer fully connected deep learning
network that aims to predict a Wiener filtering gain with an asymmetric input
context, enabling real-time applications with high constraints on signal delay.
The approach is employing a hearing instrument-grade filter bank and complies
with typical hearing aid demands, such as low latency and on-line processing.
It can further be well integrated with other algorithms in an existing HA
signal processing chain. We can show on a database of real world noise signals
that our algorithm is able to outperform a state of the art baseline approach,
both using objective metrics and subject tests.
-
We propose a novel unsupervised singing voice detection method which use
single-channel Blind Audio Source Separation (BASS) algorithm as a preliminary
step. To reach this goal, we investigate three promising BASS approaches which
operate through a morphological filtering of the analyzed mixture spectrogram.
The contributions of this paper are manyfold. First, the investigated BASS
methods are reworded with the same formalism and we investigate their
respective hyperparameters by numerical simulations. Second, we propose an
extension of the KAM method for which we propose a novel training algorithm
used to compute a source-specific kernel from a given isolated source signal.
Second, the BASS methods are compared together in terms of source separation
accuracy and in terms of singing voice detection accuracy when they are used in
our new singing voice detection framework. Finally, we do an exhaustive singing
voice detection evaluation for which we compare both supervised and
unsupervised singing voice detection methods. Our comparison explores different
combination of the proposed BASS methods with new features such as the new
proposed KAM features and the scattering transform through a machine learning
framework and also considers convolutional neural networks methods.
-
This paper describes audEERING's submissions as well as additional
evaluations for the One-Minute-Gradual (OMG) emotion recognition challenge. We
provide the results for audio and video processing on subject (in)dependent
evaluations. On the provided Development set, we achieved 0.343 Concordance
Correlation Coefficient (CCC) for arousal (from audio) and .401 for valence
(from video).
-
The performance of speaker-related systems usually degrades heavily in
practical applications largely due to the presence of background noise. To
improve the robustness of such systems in unknown noisy environments, this
paper proposes a simple pre-processing method called Noise Invariant Frame
Selection (NIFS). Based on several noisy constraints, it selects noise
invariant frames from utterances to represent speakers. Experiments conducted
on the TIMIT database showed that the NIFS can significantly improve the
performance of Vector Quantization (VQ), Gaussian Mixture Model-Universal
Background Model (GMM-UBM) and i-vector-based speaker verification systems in
different unknown noisy environments with different SNRs, in comparison to
their baselines. Meanwhile, the proposed NIFS-based speaker verification
systems achieves similar performance when we change the constraints
(hyper-parameters) or features, which indicates that it is robust and easy to
reproduce. Since NIFS is designed as a general algorithm, it could be further
applied to other similar tasks.
-
Linear Discriminant Analysis (LDA) has been used as a standard
post-processing procedure in many state-of-the-art speaker recognition tasks.
Through maximizing the inter-speaker difference and minimizing the
intra-speaker variation, LDA projects i-vectors to a lower-dimensional and more
discriminative sub-space. In this paper, we propose a neural network based
compensation scheme(termed as deep discriminant analysis, DDA) for i-vector
based speaker recognition, which shares the spirit with LDA. Optimized against
softmax loss and center loss at the same time, the proposed method learns a
more compact and discriminative embedding space. Compared with the Gaussian
distribution assumption of data and the learnt linear projection in LDA, the
proposed method doesn't pose any assumptions on data and can learn a non-linear
projection function. Experiments are carried out on a short-duration
text-independent dataset based on the SRE Corpus, noticeable performance
improvement can be observed against the normal LDA or PLDA methods.
-
This paper describes the UMONS solution for the OMG-Emotion Challenge. We
explore a context-dependent architecture where the arousal and valence of an
utterance are predicted according to its surrounding context (i.e. the
preceding and following utterances of the video). We report an improvement when
taking into account context for both unimodal and multimodal predictions.
-
We propose an end-to-end model based on convolutional and recurrent neural
networks for speech enhancement. Our model is purely data-driven and does not
make any assumptions about the type or the stationarity of the noise. In
contrast to existing methods that use multilayer perceptrons (MLPs), we employ
both convolutional and recurrent neural network architectures. Thus, our
approach allows us to exploit local structures in both the frequency and
temporal domains. By incorporating prior knowledge of speech signals into the
design of model structures, we build a model that is more data-efficient and
achieves better generalization on both seen and unseen noise. Based on
experiments with synthetic data, we demonstrate that our model outperforms
existing methods, improving PESQ by up to 0.6 on seen noise and 0.64 on unseen
noise.