
VoxCeleb is a multimodal dataset of short talking-face speech clips (each ≥ 3 seconds) curated from interview videos, designed for text-independent speaker recognition under realistic conditions like background noise, overlapping speech, laughter, pose variation, and changing lighting. It contains 7,000+ speakers, 1M+ utterances, and 2,000+ hours of audio-video data, making it a go-to benchmark for robustness testing, representation learning, and production-grade speaker embedding evaluation.