A dataset for lipreading using sequences of video frames

Published:
January 27, 2023

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Based on LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.

Dataset Technical Specification

Number of files:
100
Total dataset size:
Duration:
Format:
wav
Sample rate:
Resolution:

Dataset Demographics

Country:
Worldwide
Gender:
M/F 50-50%
Age:
18-55
Number of participants:
50