Computer Vision

Off-the-Shelf Datasets

VoxCeleb
VoxCeleb is a large-scale audio-visual speech dataset built from YouTube interview clips, widely used to train and benchmark deep speaker recognition models for speaker verification, speaker identification, and robust “in-the-wild” voice AI.
Casual Conversations Dataset
Casual Conversations is a large scale multimodal (video + audio) benchmark dataset built to evaluate and audit computer vision and speech models for accuracy across diverse ages, genders, apparent skin tones, and lighting conditions.