Most existing datasets for speaker identification contain samples obtained
under quite constrained conditions, and are usually hand-annotated, hence
limited in size. The goal of this paper is to generate a large scale
text-independent speaker identification dataset collected 'in the wild'. We
make two contributions. First, we propose a fully automated pipeline based on
computer vision techniques to create the dataset from open-source media. Our
pipeline involves obtaining videos from YouTube; performing active speaker
verification using a two-stream synchronization Convolutional Neural Network
(CNN), and confirming the identity of the speaker using CNN based facial
recognition. We use this pipeline to curate VoxCeleb which contains hundreds of
thousands of 'real world' utterances for over 1,000 celebrities. Our second
contribution is to apply and compare various state of the art speaker
identification techniques on our dataset to establish baseline performance. We
show that a CNN based architecture obtains the best performance for both
identification and verification.
Our goal is to isolate individual speakers from multi-talker simultaneous
speech in videos. Existing works in this area have focussed on trying to
separate utterances from known speakers in controlled environments. In this
paper, we propose a deep audio-visual speech enhancement network that is able
to separate a speaker's voice given lip regions in the corresponding video, by
predicting both the magnitude and the phase of the target signal. The method is
applicable to speakers unheard and unseen during training, and for
unconstrained environments. We demonstrate strong quantitative and qualitative
results, isolating extremely challenging real-world examples.
We present a method for generating a video of a talking face. The method
takes as inputs: (i) still images of the target face, and (ii) an audio speech
segment; and outputs a video of the target face lip synched with the audio. The
method runs in real time and is applicable to faces and audio not seen at
To achieve this we propose an encoder-decoder CNN model that uses a joint
embedding of the face and audio to generate synthesised talking face video
frames. The model is trained on tens of hours of unlabelled videos.
We also show results of re-dubbing videos using speech from a different
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos.
Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS)
network that learns to transcribe videos of mouth motion to characters; (2) a
curriculum learning strategy to accelerate training and to reduce overfitting;
(3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition,
consisting of over 100,000 natural sentences from British television.
The WLAS model trained on the LRS dataset surpasses the performance of all
previous work on standard lip reading benchmark datasets, often by a
significant margin. This lip reading performance beats a professional lip
reader on videos from BBC television, and we also demonstrate that visual
information helps to improve speech recognition performance even when the audio
The goal of this work is to recognise and localise short temporal signals in
image time series, where strong supervision is not available for training.
To this end we propose an image encoding that concisely represents human
motion in a video sequence in a form that is suitable for learning with a
ConvNet. The encoding reduces the pose information from an image to a single
column, dramatically diminishing the input requirements for the network, but
retaining the essential information for recognition.
The encoding is applied to the task of recognizing and localizing signed
gestures in British Sign Language (BSL) videos. We demonstrate that using the
proposed encoding, signs as short as 10 frames duration can be learnt from
clips lasting hundreds of frames using only weak (clip level) supervision and
with considerable label noise.