A dataset for lipreading using sequences of video frames

Human lipreading performance increases for longer words, indicating the importance of features capturing temporal context in an ambiguous communication channel.
Files
100
Size
Format
wav
Duration
Country
Worldwide
Participants
50
Languages
Updated
January 27, 2023

Description

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Based on LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.

Version Info

Version:
Last updated:
Owner:

Dataset Technical Specification

Number of files:
100
Total dataset size:
Duration:
Format:
wav
Sample rate:
Resolution:

Dataset Demographics

📍 Country:
Worldwide
🧍 Gender:
M/F 50-50%
📅 Age:
18-55
👥 Number of participants:
50

🛡️ Consent & Compliance