150+ Audio and Video Open Datasets

Building reliable speech and vision models starts with the right data. Twine AI helps businesses create ethical, custom audio and video datasets that reduce bias and represent real users, but we also know you don’t always need to start from scratch.

To make your model-building easier, we’ve curated 150+ open audio and video datasets you can use for benchmarking, prototyping, and comparison against custom data. From speech corpora and emotional voice datasets to video benchmarks for human activity recognition, you’ll find options across languages, domains, and modalities.

No matter your requirement, dataset language, file type, number of speakers, or participant demographics, there’s likely an open dataset you can start with. Use this list to explore what’s available, then decide where off-the-shelf data stops and a custom dataset collected with Twine needs to begin.

Simply browse and sign up to gain access.

Need a dataset tailored to your exact use case, industry, or risk profile? Contact us here.

150+ Open Audio and Video Datasets for AI & Machine Learning

Open Audio Datasets for Speech and Sound Recognition

The audio datasets below cover everything from read speech and conversational phone calls to emotional speech, environmental sounds, and music. Use them to prototype speech recognition, speaker ID, emotion detection, or general audio event classification models before you commit to collecting large-scale custom audio datasets.

Urban Sound 8K dataset

No. Recordings: 8732
File Size: 13.84KB
Filetype: .WAV/.CSV
Language(s): US English
Description: Contains Urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc.
Click here to access

Mozilla Common Voice

No. Recordings: 75,879
File Size: 63Gb
Filetype: MP3
Language(s): US English
Description: An open-source, multi-language dataset of voices that anyone can use to train speech-enabled applications.
Click here to access

HiEve

No. Recordings: 1,000,000
Filetype: MP4
Language(s): US English
Description: The largest collection of poses that focuses on very challenging and realistic tasks of human-centric analysis in various crowds & complex events, including subway getting on/off, collision, fighting, and earthquake escape
Click here to access

Voices Obscured in Complex Environmental Settings (VOICES) Dataset

No. Recordings: 3,903
File Size: 1.3Gb
Filetype: MP3
Language(s): US English
Description: A creative commons speech dataset targeting acoustically challenging and reverberant environments with robust labels and truth data for transcription, denoising, and speaker identification.
Click here to access

Free Spoken digit dataset

No. Recordings: 3000
No. Participants: 6
File Size: 10Mb
Filetype: WAV
Language(s): US English
Description: A simple audio or speech data which consists of recordings of spoken English digits
Click here to access

The Stereo Human Pose Estimation Dataset

No. Recordings: 630
No. Participants: 26
File Size: 197.8Mb
Filetype: JPEG
Language(s): US English
Description: A dataset of stereo image pairs suited for stereo human pose estimation of upper-body people.
Click here to access

The Spoken Wikipedia Corpora

No. Recordings: 5,397
No. Participants: 879
File Size: 23Gb
Filetype: MP3
Language(s): US English
Description: This is a corpus of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia
Click here to access

TED-LIUM

No. Recordings: 1,495
Language(s): US English
Description: Audio transcription of TED talks. 1495 TED talks audio recordings along with full-text transcriptions of those recordings
Click here to access

Speech Commands Dataset

No. Recordings: 65,000
Language(s): US English
Description: 65,000 one-second-long utterances of 30 short words, by thousands of different people
Click here to access

Persian Consonant Vowel Combination (PCVC) Speech Dataset

No. Recordings: 30,000
No. Participants: 217
Filetype: MAT
Language(s): US English
Description: This dataset contains 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples.
Click here to access

Arabic Speech Corpus

No. Recordings: 5439
Filetype: WAV
Language(s): Arabic
Description: Phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with a recorded speech on the phoneme level
Click here to access

TIMIT

No. Recordings: 6,300
No. Participants: 630
Filetype: WAV
Language(s): US English
Description: Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences
Click here to access

Mivia Audio Events Dataset

No. Recordings: 6,000
Filetype: WAV
Language(s): US English
Description: 6,000 events of surveillance applications, namely glass breaking, gunshots, and screams
Click here to access

Urban Sound Dataset

No. Recordings: 1,302
Filetype: WAV
Language(s): US English
Description: 1302 labeled sound recordings. Each recording is labeled with the start and end times of sound events from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music
Click here to access

Clotho Dataset

No. Recordings: 4,981
Filetype: MP3
Language(s): US English
Description:
A novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions
Click here to access

FSD50K

No. Recordings 51,197:
Filetype: WAV
Language(s): US English
Description:
An open dataset of human-labeled sound events containing Freesound clips unequally distributed in 200 classes
Click here to access

Vocal Imitation Set v1.1.3

File Size: 7.6Gb
Filetype: WAV
Language(s): US English
Description:
A collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound
Click here to access

Google Audio set

No. Recordings: 2,084,320
Filetype: WAV
Language(s): US English
Description:
635 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos
Click here to access

CALLHOME American English Speech

No. Recordings: 120
No. Participants: 240
Language(s): US English
Description: 120 unscripted 30-minute telephone conversations between native speakers of English
Click here to access

LibriSpeech ASR Corpus

No. Recordings: 1,000
Filetype: MP3
Language(s): US English
Description: 1,000 hours of 16kHz read English speech
Click here to access

Speech Accent Archive

No. Recordings: 2,140
File Size: 907Mb
Filetype: MP3
Language(s): US English
Description: Parallel English speech samples from 177 countries
Click here to access

Phone Conversation Data Sample

No. Recordings: 1,822
Filetype: WAV
Language(s): US English
Description: Conversations in Dutch, Japanese, and Irish English
Click here to access

Alexa Wake Word Voice Samples

No. Recordings: 24
Filetype: WAV
Language(s): US English
Description: Sample of 24 Alexa wake word recordings in four languages
Click here to access

The LJ Speech Dataset

No. Recordings: 1,300
File Size: 2.6Gb
Filetype: CSV
Language(s): US English
Description: Public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books
Click here to access

AISHELL-2

No. Recordings: 1,000,000
No. Participants: 1,991
Language(s): Mandarin
Description: The largest free speech corpus available for Mandarin ASR research
Click here to access

AEDD

No. Recordings: 500
No. Participants: 5
Language(s): US English
Description: 500 utterances by a diverse group of actors (over 5 actors) simulating various emotions
Click here to access

ANAD

No. Recordings: 1,384
No. Participants: 8
File Size: 2Gb
Filetype: WAV
Language(s): US English
Description: 1384 recordings by multiple speakers; 3 emotions: angry, happy, surprised
Click here to access

AudioMNIST

No. Recordings: 30,000
No. Participants: 60
Filetype: MP3
Language(s): US English
Description: Consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers
Click here to access

BAVED

No. Recordings: 1,935
No. Participants: 61
File Size: WAV
Filetype: 97.8Mb
Language(s): US English
Description: 1935 recording by 61 speakers (45 male and 16 female).
Click here to access

CMU-MOSEI

No. Participants: 1,000
Language(s): US English
Description: 65 hours of annotated video from more than 1000 speakers and 250 topics; 6 Emotions (happiness, sadness, anger, fear, disgust, surprise) + Likert scale.
Click here to access

CMU-MOSI

No. Recordings: 2,199
Language(s): US English
Description: 2199 opinion utterances with annotated sentiment; Sentiment annotated between very negative to very positive in seven Likert steps
Click here to access

CMU Wilderness

No. Participants: 699
Filetype: Mp3
Language(s): US English
Description: Speech dataset with voice actors of many accents reciting passages from the Bible
Click here to access

CREMA-D

No. Recordings: 7,442
No. Participants: 91
File Size: 163Mb
Filetype: GIT-LFS
Language(s): US English
Description: 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities
Click here to access

DAPS Dataset

No. Recordings: 100
No. Participants: 200
Language(s): US English
Description: 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books
Click here to access

Deep Clustering Dataset

File Size: 12Mb
Filetype: WAV / Mp3 / OGG
Language(s): US English
Description: Training deep discriminative embeddings to solve the cocktail party problem
Click here to access

DEMoS

No. Recordings: 9697
No. Participants: 68
Language(s): US English
Description: 9365 emotional and 332 neutral samples were produced by 68 native speakers
Click here to access

EEKK

No. Recordings: 1234
No. Participants: 10
Filetype: MP3
Language(s): US English
Description: 26 text passages read by 10 speakers; 4 main emotions:
joy, sadness, anger, and neutral
Click here to access

Emo-DB

No. Recordings: 500
No. Participants: 10
Language(s): US English
Description: 800 recordings spoken by 10 actors (5 males and 5 females); 7 emotions: anger, neutral, fear, boredom, happiness, sadness, disgust
Click here to access

EmoFilm

No. Recordings: 1115
Filetype: WAV
Language(s): US English
Description: 1115 audio instances sentences extracted from various films
Click here to access

Emotional Voice dataset – Nature

No. Recordings: 2519
No. Participants: 100
Language(s): US English
Description: 2,519 speech samples were produced by 100 actors from 5 cultures
Click here to access

Emov-DB

No. Recordings:
No. Participants: 4
File Size: 1.58GB
Language(s): US English
Description: Recordings for 4 speakers- 2 males and 2 females; The emotional styles are neutral, sleepiness, anger, disgust, and amused
Click here to access

EMOVO

No. Recordings: 84
No. Participants: 6
Language(s): US English
Description: 6 actors who played 14 sentences; 6 emotions: disgust, fear, anger, joy, surprise, sadness
Click here to access

eNTERFACE05

No. Participants: 42
File Size: 801MB
Language(s): US English
Description: Videos by 42 subjects, coming from 14 different nationalities; 6 emotions: anger, fear, surprise, happiness, sadness and disgust
Click here to access

GEMEP corpus

No. Recordings: 145
No. Participants: 10
Filetype: MP3
Language(s): US English
Description: 10 actors portraying 10 different emotional states
Click here to access

IEMOCAP

No. Participants: 10
Filetype: WAV
Language(s): US English
Description: 12 hours of audiovisual data by 10 actors; 5 emotions: happiness, anger, sadness, frustration, and neutral
Click here to access

Keio-ESD

Filetype: WAV
Language(s): US English
Description: A set of human speech with vocal emotion spoken by a Japanese male speaker; 47 emotions including angry, joy, disgusting, downgrading, funny, worried, gentle, relief, indignation, shame, etc.
Click here to access

MSP-IMPROV

No. Recordings: 8,438
No. Participants: 12
Language(s): US English
Description: 20 sentences by 12 actors; 4 emotions: angry, sad, happy, neutral, other, without agreement
Click here to access

MSP Podcast Corpus

No. Recordings: 62140
No. Participants: 3260
Language(s): US English
Description: 100 hours by over 100 speakers – annotated with emotional labels using attribute-based descriptors
Click here to access

NISQA Speech Quality Corpus

No. Recordings: 14,000
No. Participants: 3,260
Language(s): US English
Description: Includes 14k speech samples with simulated (codecs, packet-loss, background noise) and live (mobile phone, Zoom, Skype, WhatsApp) voice call degradation conditions
Click here to access

OGVC

No. Recordings: 9114
No. Participants: 4
Language(s): US English
Description: 9114 spontaneous utterances and 2656 acted utterances by 4 professional actors
Click here to access

RECOLA

No. Participants: 46
Language(s): US English
Description: 3.8 hours of recordings by 46 participants; negative and positive sentiment (valence and arousal)
Click here to access

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

No. Recordings: 7,356
No. Participants: 247
File Size: 24.8Gb
Filetype: WAV
Language(s): US English
Description: 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically matched statements in a neutral North American accent
Click here to access

SAVEE Dataset

No. Recordings: 480
No. Participants: 4
Filetype: MP4
Language(s): US English
Description: 4 male actors in 7 different emotions, 480 British English utterances in total
Click here to access

SEMAINE

No. Recordings: 95
No. Participants: 21
Language(s): US English
Description: 95 dyadic conversations from 21 subjects. Each subject converses with another playing one of four characters with emotions
Click here to access

ShEMO3000

No. Recordings: 3,000
No. Participants: 87
Filetype: WAV
Language(s): US English
Description: Semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data from online radio plays by 87 native-Persian speakers
Click here to access

Spoken Commands dataset

No. Recordings: 10,000,000
File Size: 10MB per word
Language(s): US English
Description: A testbed for voice activity detection algorithms and for recognition of syllables (single-word commands). 3 speakers, 1,500 recordings (50 of each digit per speaker), English pronunciations
Click here to access

Tess

No. Recordings: 2,800
No. Participants: 2
Filetype: WAV
Language(s): US English
Description: 2,800 recordings by 2 actresses; 7 emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutrality.
Click here to access

Thorsten dataset

No. Recordings: 22668
Filetype: WAV
Language(s): US English
Description: German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average.
Click here to access

URDU-Dataset

No. Recordings: 400
No. Participants: 38
Filetype: WAV
Language(s): US English
Description: 400 utterances by 38 speakers (27 male and 11 female); 4 emotions: angry, happy, neutral, and sad.
Click here to access

VCTK dataset

No. Recordings: 44,000
No. Participants: 110
File Size: 10.94GB
Filetype: TXT
Language(s): US English
Description: 110 English speakers with various accents; each speaker reads out about 400 sentences. Samples are mostly 2–6 s long, at 48 kHz 16 bits, for a total dataset size of ~10 GiB.
Click here to access

VIVAE

No. Recordings: 1,085
No. Participants: 12
File Size: 93.5MB
Filetype: VIVAE
Language(s): US English
Description: Non-speech, 1085 audio files by ~12 speakers; non-speech 6 emotions: achievement, anger, fear, pain, pleasure, and surprise with 3 emotional intensities (low, moderate, strong, peak).
Click here to access

VoxPopuli

No. Recordings: 400,000
File Size: 6.4T
Filetype: WAV
Language(s): US English
Description: 100K hours of unlabelled speech data for 23 languages, 1.8K hours of transcribed speech data for 16 languages, and 17.3K hours of speech-to-speech interpretation data for 16×15 directions.
Click here to access

Open Video Datasets for Computer Vision and Multimodal Models

The video datasets in this section span human action recognition, driving scenes, face recognition, and multimodal tasks that combine audio and video. They’re ideal for benchmarking computer vision models, training action classifiers, and evaluating models before you invest in a dedicated video dataset collection for your product.

Twenty Billion Neurons Crowd Acting video dataset collection

No. Recordings: 220847
File Size: 19.4GB
Filetype: WEBM
Language(s): US English
Description: Large-scale Human-centric Video Analysis in Complex Events
Click here to access

The VIRAT Video Dataset

No. Recordings: 262
File Size: 12MB
Filetype: PDF
Language(s): US English
Description: The VIRAT Video Dataset is designed to be realistic, natural, and challenging for video surveillance domains in terms of its resolution, background clutter, diversity in scenes, and human activity/event categories than existing action recognition datasets
Click here to access

The WebVid-10M Dataset

No. Recordings: 10700000
File Size: 2.5MB
Filetype: MP4
Language(s): US English
Description: A large-scale dataset of short videos with textual descriptions sourced from the web
Click here to access

The MECCANO Dataset

No. Recordings: 73206
No. Participants: 93
File Size: 32.3GB
Filetype: MP4
Language(s): US English
Description: The first dataset of egocentric videos to study human-object interactions in industrial-like settings.
Click here to access

Moments In Time

No. Recordings: 1,000,000
File Size: 150MB
Filetype: MP4
Language(s): US English
Description: A large-scale dataset for recognizing and understanding action in videos
Click here to access

Something Something Dataset

No. Recordings: 220847
File Size: 19.4GB
Filetype: WEBM
Language(s): US English
Description: A large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects
Click here to access

BDD100K

No. Recordings: 100000
File Size: 3.9GB
Filetype: MP4
Language(s): US English
Description: Comprises ten tasks and 100K videos to estimate the progress of image recognition algorithms on autonomous driving
Click here to access

Kinetics-700

No. Recordings: 650,000
File Size: 24.3MB
Filetype: MP4
Language(s): US English
Description: A large, high-quality video dataset of URL links to approximately 650000 Youtube video clips that cover 700 human action classes.
Click here to access

Casual Conversations Dataset

No. Recordings: 45,186
No. Participants: 3011
File Size: 15GB
Filetype: MP4
Language(s): US English
Description: 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models in computer vision and audio applications
Click here to access

VoxCeleb

No. Recordings: 1,000,000
No. Participants: 7,000
File Size: 133MB
Filetype: MP4
Language(s): US English
Description: An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube
Click here to access

TV Human Interaction Dataset

No. Recordings: 300
File Size: 156MB
Filetype: MP4
Language(s): US English
Description: 300+ videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss
Click here to access

THUMOS Dataset

No. Recordings: 25,000,000
File Size: 385KB
Filetype: MP4
Language(s): US English
Description: A large collection of video clips of different kinds; the dataset can be used for action classification
Click here to access

50 Salads Dataset

No. Participants: 25
File Size: 31GB
Filetype: RGB
Language(s): US English
Description: Fully annotated 4.5-hour dataset of RGB-D video + accelerometer data, capturing 25 people preparing two mixed salads each.
Click here to access

YoutubeFace

No. Recordings: 3425
No. Participants: 1595
Filetype: MP4
Language(s): US English
Description: A database of face videos designed for studying the problem of unconstrained face recognition in videos.
Click here to access

PaSc

No. Recordings: 9376
No. Participants: 293
Language(s): US English
Description: Facial recognition 9,376 still images and 2,802 videos of 293 people
Click here to access

iQIYI-VID

No. Recordings: 600000
No. Participants: 5000
Filetype: MP4
Language(s): US English
Description: The largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities.
Click here to access

COIN

No. Recordings: 11827
File Size: 8.47MB
Filetype: JSON
Language(s): US English
Description: 11,827 videos related to 180 different tasks, which were all collected from YouTube
Click here to access

CityScapes

No. Recordings: 25000
File Size: 51.92GB
Filetype: JPG
Language(s): US English
Description: A large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities
Click here to access

AVA-Kinetics Dataset

No. Recordings: 3650000
No. Participants: 39000
File Size: 7.7MB
Filetype: CSV
Language(s): US English
Description: AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity.
Click here to access

Activity Net

No. Recordings: 20,194
File Size: 600GB
Filetype: JSON
Language(s): US English
Description: A Large-Scale Video Benchmark for Human Activity Understanding
Click here to access

Kinetics

No. Recordings: 650000
File Size: 24.3MB
Filetype: MP4
Language(s): US English
Description: A collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.
Click here to access

Yahoo-Flickr Creative Commons 100 Million Dataset

No. Recordings: 100000000
File Size: 15GB
Filetype: MP4
Language(s): US English
Description: The YFCC100M is the largest publicly and freely usable multimedia collection, containing around 99.2 million photos and 0.8 million videos from Flickr, all of which were shared under one of the various Creative Commons licenses
Click here to access

UMDFaces

No. Recordings: 4067888
No. Participants: 11377
File Size: 173MB
Filetype: MP4
Language(s): US English
Description: UMDFaces is a face dataset divided into two parts: Still Images – 367,888 face annotations for 8,277 subjects and Video Frames – Over 3.7 million annotated video frames from over 22,000 videos of 3100 subjects.
Click here to access

Condensed Movies

No. Recordings: 462,000
File Size: 250GB
Filetype: MP4
Language(s): US English
Description: A large-scale video dataset, featuring clips from movies with detailed captions
Click here to access

AVSpeech

No. Recordings: 290,000
File Size: 128MB
Filetype: MP4
Language(s): US English
Description: AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering background noises
Click here to access

EyeC3D

No. Participants: 21
File Size: 3.9GB
Language(s): US English
Description: 3D video eye tracking dataset
Click here to access

MoVi

No. Recordings: 1890
No. Participants: 90
File Size: 1.3MB
Filetype: MP4
Language(s): US English
Description: A large multi-purpose human motion and video dataset
Click here to access

Thör

No. Recordings: 22668
File Size: WAV
Language(s): US English
Description: A public dataset of human motion trajectories, recorded in a controlled indoor experiment.
Click here to access

SEWA

No. Participants: 398
Filetype: WAV
Language(s): US English
Description: More than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal.
Click here to access

Open Audio Datasets in Other Languages

If you’re working beyond English, these open audio datasets in other languages can jump-start your multilingual ASR, TTS, or emotion-recognition projects. For production use, teams often pair them with smaller, carefully collected custom datasets targeting specific dialects, industries, or recording conditions.

The SIWIS French Speech Synthesis Database

No. Recordings: 9,750
File Size: 2.671Gb
Filetype: .WAV
Language(s): French
Description: High-quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis
Click here to access

TCOF : Traitement de Corpus Oraux en Français

No. Recordings: 626
Filetype: .WAV
Language(s): French
Description: The corpus made available includes two main categories: recordings of adult-child interactions (children up to 7 years old) and recordings of interactions between adults. The recordings are of various durations: from 5 to 45 minutes or more.
Click here to access

African Accented French

No. Participants: 84
File Size: 1.8Gb
Filetype: .WAV
Language(s): French
Description: This corpus consists of approximately 22 hours of speech recordings. It has recordings from 84 speakers, 48 male, and 36 female.
Click here to access

Fisher Spanish Speech

No. Participants: 136
No. Recordings: 819
Filetype: .WAV
Language(s): Spanish
Description: This corpus consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.
Click here to access

CallFriend – Spanish Corpus

No. Participants: 120
No. Recordings: 60
Filetype: .WAV
Language(s): Spanish
Description: The CallFriend Spanish corpus of telephone speech was collected by the Linguistic Data Consortium primarily in support of the project on Language Identification (LID), sponsored by the U.S. Department of Defense and consists of 60 unscripted telephone conversations between native speakers of Spanish for each dialect group
Click here to access

TV3Parla

File Size: 27.6Gb
Filetype: .WAV
Language(s): Catalan
Description: This corpus includes 240 hours of Catalan speech from broadcast material.
Click here to access

emotiontts_open_db

Filetype: .WAV
Language(s): Korean
Description: Recordings and their associated transcriptions by a diverse group of speakers covering 4 emotions: general, joy, anger, and sadness.
Click here to access

Pansori TEDxKR

No. Participants: 41
No. Recordings: 60
File Size: 174Mb
Filetype: .FLAC
Language(s): Korean
Description: The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers. This corpus was generated by using a new corpus data ingestion and processing system called Pansori.
Click here to access

EMOVO

No. Participants: 6
No. Recordings: 84
File Size: 237Mb
Filetype: .WAV
Language(s): Italian
Description: This dataset consists of 6 actors who recite 14 sentences in 6 different emotions: disgust, fear, anger, joy, surprise, sadness.
Click here to access

Online gaming voice chat corpus (OGVC)

No. Participants – 17
No. Recordings: 2,656
Filetype: .WAV
Language(s): Japanese
Description: This speech material contains 2,656 acted utterances spoken by four professional actors (two male and two female). 17 short dialogues were selected from the dialogues recorded for the naturalistic emotional speech. The actors were instructed to speak each utterance in the short dialog with a specific emotion in three different levels of emotional intensity.
Click here to access

Keio University Japanese Emotional Speech Database (Keio-ESD)

No. Participants – 1
Filetype: .WAV
Language(s): Japanese
A set of human speech with vocal emotion spoken by a Japanese male speaker and a set of artificial speech that were synthesized by a system that had been developed using the subset of this database for training.
Click here to access

NST Danish ASR Database

No. Participants – 616
No. Recordings: 229,992
Filetype: .WAV
Language(s): Danish
Description: This database was created by Nordic Language Technology for the development of automatic speech recognition and dictation in Danish.
Click here to access

NST Danish Dictation

No. Participants – 151
No. Recordings: 34,955
Filetype: .WAV
Language(s): Danish
Description: This database contains speech data for Danish, made for dictation.
Click here to access

NST Danish Speech Synthesis

No. Participants – 1
No. Recordings: 4,108
Filetype: .WAV
Language(s): Danish
Description: This database contains speech data for Danish, made for speech synthesis.
Click here to access

FT Speech

No. Participants: 434
No. Recordings: 1,017,244
Filetype: .WAV
Language: Danish
Description: FT Speech is a new speech corpus created from the recorded meetings of the Danish Parliament, also known as the Folketing (FT). It contains over 1,800 hours of transcribed speech by a total of 434 speakers, which are partitioned into five subsets with no speaker overlap between train, development, and test data.
Click here to access

FalaBrasil-LaPS Benchmark

No. Participants: 35
No. Recordings: 700
Filetype: .WAV
Language: Portuguese
Description: LaPS is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded at 22.05 kHz without environment control.
Click here to access

M-AILABS Polish Corpus

No. Participants: 35
No. Recordings: 700
File Size: 110Gb
Filetype: .WAV
Language: Polish
Description: The M-AILABS Speech Dataset is the first large dataset that we are providing free-of-charge, freely usable as training data for speech recognition and speech synthesis.
Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly 1,000 hours of audio and text files in a prepared format. The texts were published between 1884 and 1964, and are in the public domain.
Click here to access

Estonian

No. Participants: 10
No. Recordings: 1,040
Filetype: .WAV
Language: Estonian
Description: 26 text passages read by 10 speakers, covering 4 main emotions: joy, sadness, anger, and neutral.
Click here to access

AESDD

No. Participants: 10
No. Recordings: 500
File Size: 391Mb
Filetype: .WAV
Language: Greek
Description: The Acted Emotional Speech Dynamic Database (AESDD) is a publically available speech emotion recognition dataset. It contains utterances of acted emotional speech in the Greek language. The dataset consists of 500 utterances recorded by a diverse group of actors covering 5 different emotions: anger, disgust, fear, happiness, and sadness.
Click here to access

Microsoft Speech Corpus (Indian languages)

No. Recordings: 124,599
Filetype: .WAV
Languages: Telugu; Tamil; Gujarati
Description: Microsoft Speech Corpus (Indian languages) release contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages. The data package includes audio and corresponding transcripts.
Click here to access

Tunisian

No. Participants: 118
No. Recordings: 11.2 hours
File Size: 391Mb
Filetype: .WAV
Language: Greek
Description: MSA Modern Standard Arabic (Tunisia)
118 speakers
Click here to access

AISHELL-1

No. Participants: 400
File Size: 15Gb
Filetype: .WAV
Language: Mandarin
Description: Aishell is an open-source Chinese Mandarin speech corpus published by Beijing Shell Shell Technology Co., Ltd.
400 people from different accent areas in China are invited to participate in the recording, which is conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz. The manual transcription accuracy is above 95%, through professional speech annotation and strict quality inspection. The data is free for academic use.
Click here to access

Malayalam Speech Corpus

No. Participants: 75
File Size: 326Mb
Filetype: .WAV
Language: Malayalam
Description: The Malayalam Speech Corpus (MSC) is one of the first open speech corpora for Automatic Speech Recognition (ASR) research and consists of 250 hours of Agricultural speech data involving 3 female, 12 male, and 60 unidentified participants.
Click here to access

Google Malayalam

No. Participants: 24
File Size: 1.345Gb
Filetype: .WAV
Language: Malayalam
Description: This data set contains transcribed high-quality audio of Malayalam sentences recorded by volunteers.
Click here to access

Facebook AI is releasing Multilingual LibriSpeech

File Size: 3Tb
Filetype: .WAV
Languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish
Description: Multilingual Librispeech (MLS) is a large-scale, open-source data set designed to help advance research in automatic speech recognition (ASR). MLS is designed to help the speech research community work in languages beyond just English so people around the world can benefit from improvements in a wide range of AI-powered services.
Click here to access

The BABEL Project

Filetype: .WAV
Language: Bulgarian, Estonian, Hungarian, Polish, and Romanian
Description: BABEL was a joint European project under the COPERNICUS scheme comprising partners from a number of Eastern and Western European research centers. BABEL has produced a multi-language database comprising five of the most widely differing Eastern European languages.
Click here to access

Living Audio Dataset

Languages: Dutch, English, Irish, Russian
Description: A “Crowd-Built” continuously growing speech dataset with transcripts. The dataset contains multiple languages and is intended for anyone to be able to add to it.
Click here to access

Microsoft Speech Language Translation Corpus

No. Recordings: 61,270
File Size: 326Mb
Filetype: .WAV
Languages: English, Chinese, Japanese
Description: The Microsoft Speech Language Translation Corpus release contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese collected by Microsoft Research. The package includes audio data, transcripts, and translations and allows end-to-end testing of spoken language translation systems on real-world data.
Click here to access

How to Choose the Right Open Dataset for Your Model

With so many open audio and video datasets available, it helps to be systematic:

Match modality to task:
- For speech recognition, prioritise clean, transcribed speech datasets (LibriSpeech, AISHELL-1, Common Voice, AISHELL-2, etc.).
- For sound event detection, look at environmental sound corpora like UrbanSound8K, AudioSet, or FSD50K.
- For computer vision, use action and scene datasets such as Kinetics, Moments in Time, BDD100K, or Cityscapes.
Check licence and usage rights:
Some open datasets are research-only, others allow commercial use. Always confirm you can legally use the data in your product.
Consider domain and demographics:
If your users aren’t English speakers or don’t match the demographics in common benchmarks, you’ll likely need additional, targeted data to avoid bias.
Decide when to go custom:
Open datasets are excellent for experimentation, but for safety-critical, regulated, or brand-sensitive applications, you’ll usually get better performance and fewer surprises with custom audio and video datasets collected for your exact use case. Twine AI can help you scope, collect, and label that data at scale

Conclusion

We hope this list of 150+ open audio and video datasets has made it easier to find the training data you need to get started. Whether you’re benchmarking speech models, experimenting with emotion recognition, or building a new computer vision pipeline, off-the-shelf datasets are a powerful way to prototype quickly and compare approaches.

Once you know what’s possible with open data, the next step is usually to close the gap between “benchmark performance” and “real-world performance”. That’s where custom audio and video datasets come in, tailored to your languages, environments, hardware, and user base. Twine AI can help you design, collect, and label these datasets end-to-end, across 150+ languages and dialects.

Ready for a dataset that’s built specifically for your use case? Contact the Twine AI team and we’ll help you scope, budget, and deliver a custom audio or video dataset.

AI datasets machine learning

150+ Open Audio and Video Datasets for Machine Learning

150+ Open Audio and Video Datasets for AI & Machine Learning

Open Audio Datasets for Speech and Sound Recognition

Open Video Datasets for Computer Vision and Multimodal Models

Open Audio Datasets in Other Languages

How to Choose the Right Open Dataset for Your Model

Conclusion

Twine

Need audio training data?

Need vision training data?