150+ Open Audio and Video Datasets

Twine AI enables businesses to build ethical, custom datasets that reduce model bias and cover areas where humans are subjects, such as voice and vision. To help make model-building easier, we have put together a list of over 150 Open Audio and Video Datasets.

No matter the requirement—from dataset language to file type to participant gender—there is a dataset perfect for your machine-learning model.

Simply browse and sign up to gain access.

Need a custom dataset specific to your project? Contact us here.


150+ Audio and Video Open Datasets

Open Datasets – Audio

Urban Sound 8K dataset
No. Recordings: 8732
File Size: 13.84KB
Filetype: .WAV/.CSV
Language(s): US English
Description: Contains Urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc.
Click here to access
Mozilla Common Voice
No. Recordings: 75,879
File Size: 63Gb
Filetype: MP3
Language(s): US English
Description: An open-source, multi-language dataset of voices that anyone can use to train speech-enabled applications.
Click here to access
HiEve
No. Recordings: 1,000,000
Filetype: MP4
Language(s): US English
Description: The largest collection of poses that focuses on very challenging and realistic tasks of human-centric analysis in various crowds & complex events, including subway getting on/off, collision, fighting, and earthquake escape
Click here to access
Voices Obscured in Complex Environmental Settings (VOICES) Dataset
No. Recordings: 3,903
File Size: 1.3Gb
Filetype: MP3
Language(s): US English
Description: A creative commons speech dataset targeting acoustically challenging and reverberant environments with robust labels and truth data for transcription, denoising, and speaker identification.
Click here to access
Free Spoken digit dataset
No. Recordings: 3000
No. Participants: 6
File Size: 10Mb
Filetype: WAV
Language(s): US English
Description: A simple audio or speech data which consists of recordings of spoken English digits
Click here to access
The Stereo Human Pose Estimation Dataset
No. Recordings: 630
No. Participants: 26
File Size: 197.8Mb
Filetype: JPEG
Language(s): US English
Description: A dataset of stereo image pairs suited for stereo human pose estimation of upper-body people.
Click here to access
The Spoken Wikipedia Corpora
No. Recordings: 5,397
No. Participants: 879
File Size: 23Gb
Filetype: MP3
Language(s): US English
Description: This is a corpus of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia
Click here to access
TED-LIUM
No. Recordings: 1,495
Language(s): US English
Description: Audio transcription of TED talks. 1495 TED talks audio recordings along with full-text transcriptions of those recordings
Click here to access
Speech Commands Dataset
No. Recordings: 65,000
Language(s): US English
Description: 65,000 one-second-long utterances of 30 short words, by thousands of different people
Click here to access
Persian Consonant Vowel Combination (PCVC) Speech Dataset
No. Recordings: 30,000
No. Participants: 217
Filetype: MAT
Language(s): US English
Description: This dataset contains 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples.
Click here to access
Arabic Speech Corpus
No. Recordings: 5439
Filetype: WAV
Language(s): Arabic
Description: Phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with a recorded speech on the phoneme level
Click here to access
TIMIT
No. Recordings: 6,300
No. Participants: 630
Filetype: WAV
Language(s): US English
Description: Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences
Click here to access
Mivia Audio Events Dataset
No. Recordings: 6,000
Filetype: WAV
Language(s): US English
Description: 6,000 events of surveillance applications, namely glass breaking, gunshots, and screams
Click here to access
Urban Sound Dataset
No. Recordings: 1,302
Filetype: WAV
Language(s): US English
Description: 1302 labeled sound recordings. Each recording is labeled with the start and end times of sound events from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music
Click here to access
Clotho Dataset
No. Recordings: 4,981
Filetype: MP3
Language(s): US English
Description:
A novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions
Click here to access
FSD50K
No. Recordings 51,197: 
Filetype: WAV
Language(s): US English
Description:
An open dataset of human-labeled sound events containing Freesound clips unequally distributed in 200 classes
Click here to access
Vocal Imitation Set v1.1.3
File Size: 7.6Gb
Filetype: WAV
Language(s): US English
Description:
A collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound
Click here to access
Google Audio set
No. Recordings: 2,084,320
Filetype: WAV
Language(s): US English
Description:
635 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos
Click here to access
CALLHOME American English Speech
No. Recordings: 120
No. Participants: 240
Language(s): US English
Description: 120 unscripted 30-minute telephone conversations between native speakers of English
Click here to access
LibriSpeech ASR Corpus
No. Recordings: 1,000
Filetype: MP3
Language(s): US English
Description: 1,000 hours of 16kHz read English speech
Click here to access
Speech Accent Archive
No. Recordings: 2,140
File Size: 907Mb
Filetype: MP3
Language(s): US English
Description: Parallel English speech samples from 177 countries
Click here to access
Phone Conversation Data Sample
No. Recordings: 1,822
Filetype: WAV
Language(s): US English
Description: Conversations in Dutch, Japanese, and Irish English
Click here to access
Alexa Wake Word Voice Samples
No. Recordings: 24
Filetype: WAV
Language(s): US English
Description: Sample of 24 Alexa wake word recordings in four languages
Click here to access
The LJ Speech Dataset
No. Recordings: 1,300
File Size: 2.6Gb
Filetype: CSV
Language(s): US English
Description: Public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books
Click here to access
AISHELL-2
No. Recordings: 1,000,000
No. Participants: 1,991
Language(s): Mandarin
Description: The largest free speech corpus available for Mandarin ASR research
Click here to access
AEDD
No. Recordings: 500
No. Participants: 5
Language(s): US English
Description: 500 utterances by a diverse group of actors (over 5 actors) simulating various emotions
Click here to access
ANAD
No. Recordings: 1,384
No. Participants: 8
File Size: 2Gb
Filetype: WAV
Language(s): US English
Description: 1384 recordings by multiple speakers; 3 emotions: angry, happy, surprised
Click here to access
AudioMNIST
No. Recordings: 30,000
No. Participants: 60
Filetype: MP3
Language(s): US English
Description: Consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers
Click here to access
BAVED
No. Recordings: 1,935
No. Participants: 61
File Size: WAV
Filetype: 97.8Mb
Language(s): US English
Description: 1935 recording by 61 speakers (45 male and 16 female).
Click here to access
CMU-MOSEI
No. Participants: 1,000
Language(s): US English
Description: 65 hours of annotated video from more than 1000 speakers and 250 topics; 6 Emotions (happiness, sadness, anger, fear, disgust, surprise) + Likert scale.
Click here to access
CMU-MOSI
No. Recordings: 2,199
Language(s): US English
Description: 2199 opinion utterances with annotated sentiment; Sentiment annotated between very negative to very positive in seven Likert steps
Click here to access
CMU Wilderness
No. Participants: 699
Filetype: Mp3
Language(s): US English
Description: Speech dataset with voice actors of many accents reciting passages from the Bible
Click here to access
CREMA-D
No. Recordings: 7,442
No. Participants: 91
File Size: 163Mb
Filetype: GIT-LFS
Language(s): US English
Description: 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities
Click here to access
DAPS Dataset
No. Recordings: 100
No. Participants: 200
Language(s): US English
Description: 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books
Click here to access
Deep Clustering Dataset
File Size: 12Mb
Filetype: WAV / Mp3 / OGG 
Language(s): US English
Description: Training deep discriminative embeddings to solve the cocktail party problem
Click here to access
DEMoS
No. Recordings: 9697
No. Participants: 68
Language(s): US English
Description: 9365 emotional and 332 neutral samples were produced by 68 native speakers 
Click here to access
EEKK
No. Recordings: 1234
No. Participants: 10
Filetype: MP3
Language(s): US English
Description: 26 text passages read by 10 speakers; 4 main emotions:
joy, sadness, anger, and neutral
Click here to access
Emo-DB
No. Recordings: 500
No. Participants: 10
Language(s): US English
Description: 800 recordings spoken by 10 actors (5 males and 5 females); 7 emotions: anger, neutral, fear, boredom, happiness, sadness, disgust
Click here to access
EmoFilm
No. Recordings: 1115
Filetype: WAV
Language(s): US English
Description: 1115 audio instances sentences extracted from various films
Click here to access
Emotional Voice dataset – Nature
No. Recordings: 2519
No. Participants: 100
Language(s): US English
Description: 2,519  speech samples were produced by 100 actors from 5 cultures
Click here to access
Emov-DB
No. Recordings: 
No. Participants: 4
File Size: 1.58GB
Language(s): US English
Description: Recordings for 4 speakers- 2 males and 2 females; The emotional styles are neutral, sleepiness, anger, disgust, and amused
Click here to access
EMOVO
No. Recordings: 84
No. Participants: 6
Language(s): US English
Description: 6 actors who played 14 sentences; 6 emotions: disgust, fear, anger, joy, surprise, sadness
Click here to access
eNTERFACE05
No. Participants: 42
File Size: 801MB
Language(s): US English
Description: Videos by 42 subjects, coming from 14 different nationalities; 6 emotions: anger, fear, surprise, happiness, sadness and disgust
Click here to access
GEMEP corpus
No. Recordings: 145
No. Participants: 10
Filetype: MP3
Language(s): US English
Description: 10 actors portraying 10 different emotional states
Click here to access
IEMOCAP
No. Participants: 10
Filetype: WAV
Language(s): US English
Description: 12 hours of audiovisual data by 10 actors; 5 emotions: happiness, anger, sadness, frustration, and neutral
Click here to access
Keio-ESD
Filetype: WAV
Language(s): US English
Description: A set of human speech with vocal emotion spoken by a Japanese male speaker; 47 emotions including angry, joy, disgusting, downgrading, funny, worried, gentle, relief, indignation, shame, etc.
Click here to access
MSP-IMPROV
No. Recordings: 8,438
No. Participants: 12
Language(s): US English
Description: 20 sentences by 12 actors; 4 emotions: angry, sad, happy, neutral, other, without agreement
Click here to access
MSP Podcast Corpus
No. Recordings: 62140
No. Participants: 3260
Language(s): US English
Description: 100 hours by over 100 speakers – annotated with emotional labels using attribute-based descriptors
Click here to access
NISQA Speech Quality Corpus
No. Recordings: 14,000
No. Participants: 3,260
Language(s): US English
Description: Includes 14k speech samples with simulated (codecs, packet-loss, background noise) and live (mobile phone, Zoom, Skype, WhatsApp) voice call degradation conditions
Click here to access
OGVC
No. Recordings: 9114 
No. Participants: 4
Language(s): US English
Description: 9114 spontaneous utterances and 2656 acted utterances by 4 professional actors
Click here to access
RECOLA
No. Participants: 46
Language(s): US English
Description: 3.8 hours of recordings by 46 participants; negative and positive sentiment (valence and arousal)
Click here to access
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
No. Recordings: 7,356
No. Participants: 247
File Size: 24.8Gb
Filetype: WAV
Language(s): US English
Description: 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically matched statements in a neutral North American accent
Click here to access
SAVEE Dataset
No. Recordings: 480
No. Participants: 4
Filetype: MP4
Language(s): US English
Description: 4 male actors in 7 different emotions, 480 British English utterances in total
Click here to access
SEMAINE
No. Recordings: 95
No. Participants: 21
Language(s): US English
Description: 95 dyadic conversations from 21 subjects. Each subject converses with another playing one of four characters with emotions
Click here to access
ShEMO3000 
No. Recordings: 3,000
No. Participants: 87
Filetype: WAV
Language(s): US English
Description: Semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data from online radio plays by 87 native-Persian speakers
Click here to access
Spoken Commands dataset
No. Recordings: 10,000,000
File Size: 10MB per word
Language(s): US English
Description: A testbed for voice activity detection algorithms and for recognition of syllables (single-word commands). 3 speakers, 1,500 recordings (50 of each digit per speaker), English pronunciations
Click here to access
Tess
No. Recordings: 2,800
No. Participants: 2
Filetype: WAV
Language(s): US English
Description: 2,800 recordings by 2 actresses; 7 emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutrality.
Click here to access
Thorsten dataset
No. Recordings: 22668
Filetype: WAV
Language(s): US English
Description: German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average.
Click here to access
URDU-Dataset
No. Recordings: 400
No. Participants: 38
Filetype: WAV
Language(s): US English
Description: 400 utterances by 38 speakers (27 male and 11 female); 4 emotions: angry, happy, neutral, and sad.
Click here to access
VCTK dataset
No. Recordings: 44,000
No. Participants: 110
File Size: 10.94GB
Filetype: TXT
Language(s): US English
Description: 110 English speakers with various accents; each speaker reads out about 400 sentences. Samples are mostly 2–6 s long, at 48 kHz 16 bits, for a total dataset size of ~10 GiB.
Click here to access
VIVAE
No. Recordings: 1,085
No. Participants: 12
File Size: 93.5MB
Filetype: VIVAE
Language(s): US English
Description: Non-speech, 1085 audio files by ~12 speakers; non-speech 6 emotions: achievement, anger, fear, pain, pleasure, and surprise with 3 emotional intensities (low, moderate, strong, peak).
Click here to access
VoxPopuli
No. Recordings: 400,000
File Size: 6.4T
Filetype: WAV
Language(s): US English
Description: 100K hours of unlabelled speech data for 23 languages, 1.8K hours of transcribed speech data for 16 languages, and 17.3K hours of speech-to-speech interpretation data for 16×15 directions.
Click here to access

Open Datasets – Video

Twenty Billion Neurons Crowd Acting video dataset collection
No. Recordings: 220847
File Size: 19.4GB
Filetype: WEBM
Language(s): US English
Description: Large-scale Human-centric Video Analysis in Complex Events
Click here to access
The VIRAT Video Dataset
No. Recordings: 262
File Size: 12MB
Filetype: PDF
Language(s): US English
Description: The VIRAT Video Dataset is designed to be realistic, natural, and challenging for video surveillance domains in terms of its resolution, background clutter, diversity in scenes, and human activity/event categories than existing action recognition datasets
Click here to access
The WebVid-10M Dataset
No. Recordings: 10700000
File Size: 2.5MB
Filetype: MP4
Language(s): US English
Description: A large-scale dataset of short videos with textual descriptions sourced from the web
Click here to access
The MECCANO Dataset
No. Recordings: 73206
No. Participants: 93
File Size: 32.3GB
Filetype: MP4
Language(s): US English
Description: The first dataset of egocentric videos to study human-object interactions in industrial-like settings.
Click here to access
Moments In Time
No. Recordings: 1,000,000
File Size: 150MB
Filetype: MP4
Language(s): US English
Description: A large-scale dataset for recognizing and understanding action in videos
Click here to access
Something Something Dataset
No. Recordings: 220847
File Size: 19.4GB
Filetype: WEBM
Language(s): US English
Description: A large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects
Click here to access
BDD100K
No. Recordings: 100000
File Size: 3.9GB
Filetype: MP4
Language(s): US English
Description: Comprises ten tasks and 100K videos to estimate the progress of image recognition algorithms on autonomous driving
Click here to access
Kinetics-700
No. Recordings: 650,000
File Size: 24.3MB
Filetype: MP4
Language(s): US English
Description: A large, high-quality video dataset of URL links to approximately 650000 Youtube video clips that cover 700 human action classes.
Click here to access
Casual Conversations Dataset
No. Recordings: 45,186
No. Participants: 3011
File Size: 15GB
Filetype: MP4
Language(s): US English
Description: 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models in computer vision and audio applications
Click here to access
VoxCeleb
No. Recordings: 1,000,000
No. Participants: 7,000
File Size: 133MB
Filetype: MP4
Language(s): US English
Description: An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube
Click here to access
TV Human Interaction Dataset
No. Recordings: 300
File Size: 156MB
Filetype: MP4
Language(s): US English
Description: 300+ videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss
Click here to access
THUMOS Dataset
No. Recordings: 25,000,000
File Size: 385KB
Filetype: MP4
Language(s): US English
Description: A large collection of video clips of different kinds; the dataset can be used for action classification
Click here to access
50 Salads Dataset
No. Participants: 25
File Size: 31GB
Filetype: RGB
Language(s): US English
Description: Fully annotated 4.5-hour dataset of RGB-D video + accelerometer data, capturing 25 people preparing two mixed salads each.
Click here to access
YoutubeFace
No. Recordings: 3425
No. Participants: 1595
Filetype: MP4
Language(s): US English
Description: A database of face videos designed for studying the problem of unconstrained face recognition in videos.
Click here to access
PaSc
No. Recordings: 9376
No. Participants: 293
Language(s): US English
Description: Facial recognition 9,376 still images and 2,802 videos of 293 people
Click here to access
iQIYI-VID
No. Recordings: 600000
No. Participants: 5000
Filetype: MP4
Language(s): US English
Description: The largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities.
Click here to access
COIN
No. Recordings: 11827
File Size: 8.47MB
Filetype: JSON
Language(s): US English
Description: 11,827 videos related to 180 different tasks, which were all collected from YouTube
Click here to access
CityScapes
No. Recordings: 25000
File Size: 51.92GB
Filetype: JPG
Language(s): US English
Description: A large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities
Click here to access
AVA-Kinetics Dataset
No. Recordings: 3650000
No. Participants: 39000
File Size: 7.7MB
Filetype: CSV
Language(s): US English
Description: AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity.
Click here to access
Activity Net
No. Recordings: 20,194
File Size: 600GB
Filetype: JSON
Language(s): US English
Description: A Large-Scale Video Benchmark for Human Activity Understanding
Click here to access
Kinetics
No. Recordings: 650000
File Size: 24.3MB
Filetype: MP4
Language(s): US English
Description: A collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.
Click here to access
Yahoo-Flickr Creative Commons 100 Million Dataset
No. Recordings: 100000000
File Size: 15GB
Filetype: MP4
Language(s): US English
Description: The YFCC100M is the largest publicly and freely usable multimedia collection, containing  around 99.2 million photos and 0.8 million videos from Flickr, all of which were shared under one of the various Creative Commons licenses
Click here to access
UMDFaces
No. Recordings: 4067888
No. Participants: 11377
File Size: 173MB
Filetype: MP4
Language(s): US English
Description: UMDFaces is a face dataset divided into two parts: Still Images – 367,888 face annotations for 8,277 subjects and Video Frames – Over 3.7 million annotated video frames from over 22,000 videos of 3100 subjects.
Click here to access
Condensed Movies
No. Recordings: 462,000
File Size: 250GB
Filetype: MP4
Language(s): US English
Description: A large-scale video dataset, featuring clips from movies with detailed captions
Click here to access
AVSpeech
No. Recordings: 290,000
File Size: 128MB
Filetype: MP4
Language(s): US English
Description: AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering background noises
Click here to access
EyeC3D
No. Participants: 21
File Size: 3.9GB
Language(s): US English
Description: 3D video eye tracking dataset
Click here to access
MoVi
No. Recordings: 1890
No. Participants: 90
File Size: 1.3MB
Filetype: MP4
Language(s): US English
Description: A large multi-purpose human motion and video dataset
Click here to access
Thör
No. Recordings: 22668
File Size: WAV
Language(s): US English
Description: A public dataset of human motion trajectories, recorded in a controlled indoor experiment.
Click here to access
SEWA
No. Participants: 398
Filetype: WAV
Language(s): US English
Description: More than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal.
Click here to access

Other Languages

The SIWIS French Speech Synthesis Database
No. Recordings: 9,750
File Size: 2.671Gb
Filetype: .WAV
Language(s): French
Description: High-quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis
Click here to access
TCOF : Traitement de Corpus Oraux en Français
No. Recordings: 626
Filetype: .WAV
Language(s): French
Description: The corpus made available includes two main categories: recordings of adult-child interactions (children up to 7 years old) and recordings of interactions between adults. The recordings are of various durations: from 5 to 45 minutes or more. 
Click here to access
African Accented French
No. Participants: 84
File Size: 1.8Gb
Filetype: .WAV
Language(s): French
Description: This corpus consists of approximately 22 hours of speech recordings. It has recordings from 84 speakers, 48 male, and 36 female.
Click here to access
Fisher Spanish Speech
No. Participants: 136
No. Recordings: 819
Filetype: .WAV
Language(s): Spanish
Description: This corpus consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.
Click here to access
CallFriend – Spanish Corpus
No. Participants: 120
No. Recordings: 60
Filetype: .WAV
Language(s): Spanish
Description: The CallFriend Spanish corpus of telephone speech was collected by the Linguistic Data Consortium primarily in support of the project on Language Identification (LID), sponsored by the U.S. Department of Defense and consists of 60 unscripted telephone conversations between native speakers of Spanish for each dialect group
Click here to access
TV3Parla
File Size: 27.6Gb 
Filetype: .WAV
Language(s): Catalan
Description: This corpus includes 240 hours of Catalan speech from broadcast material.
Click here to access
emotiontts_open_db
Filetype: .WAV
Language(s): Korean
Description: Recordings and their associated transcriptions by a diverse group of speakers covering 4 emotions: general, joy, anger, and sadness.
Click here to access
Pansori TEDxKR
No. Participants: 41
No. Recordings: 60
File Size: 174Mb 
Filetype: .FLAC
Language(s): Korean
Description: The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers. This corpus was generated by using a new corpus data ingestion and processing system called Pansori.
Click here to access
EMOVO
No. Participants: 6
No. Recordings: 84
File Size: 237Mb 
Filetype: .WAV
Language(s): Italian
Description: This dataset consists of 6 actors who recite 14 sentences in 6 different emotions: disgust, fear, anger, joy, surprise, sadness.
Click here to access
Online gaming voice chat corpus (OGVC)
No. Participants – 17
No. Recordings: 2,656
Filetype: .WAV
Language(s): Japanese
Description: This speech material contains 2,656 acted utterances spoken by four professional actors (two male and two female). 17 short dialogues were selected  from the dialogues recorded for the naturalistic emotional speech. The actors were instructed to speak each utterance in the short dialog with a specific emotion in three different levels of emotional intensity.
Click here to access
Keio University Japanese Emotional Speech Database (Keio-ESD)
No. Participants – 1
Filetype: .WAV
Language(s): Japanese
A set of human speech with vocal emotion spoken by a Japanese male speaker and a set of artificial speech that were synthesized by a system that had been developed using the subset of this database for training.
Click here to access
NST Danish ASR Database
No. Participants – 616
No. Recordings: 229,992
Filetype: .WAV
Language(s): Danish
Description: This database was created by Nordic Language Technology for the development of automatic speech recognition and dictation in Danish.
Click here to access
NST Danish Dictation
No. Participants – 151
No. Recordings: 34,955
Filetype: .WAV
Language(s): Danish
Description: This database contains speech data for Danish, made for dictation.
Click here to access
NST Danish Speech Synthesis
No. Participants – 1
No. Recordings: 4,108
Filetype: .WAV
Language(s): Danish
Description: This database contains speech data for Danish, made for speech synthesis.
Click here to access
FT Speech
No. Participants: 434 
No. Recordings: 1,017,244
Filetype: .WAV
Language: Danish
Description: FT Speech is a new speech corpus created from the recorded meetings of the Danish Parliament, also known as the Folketing (FT). It contains over 1,800 hours of transcribed speech by a total of 434 speakers, which are partitioned into five subsets with no speaker overlap between train, development, and test data.
Click here to access
FalaBrasil-LaPS Benchmark
No. Participants: 35 
No. Recordings: 700
Filetype: .WAV
Language: Portuguese 
Description: LaPS is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded at 22.05 kHz without environment control.
Click here to access
M-AILABS Polish Corpus
No. Participants: 35 
No. Recordings: 700
File Size: 110Gb 
Filetype: .WAV
Language: Polish 
Description: The M-AILABS Speech Dataset is the first large dataset that we are providing free-of-charge, freely usable as training data for speech recognition and speech synthesis.
Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly 1,000 hours of audio and text files in a prepared format. The texts were published between 1884 and 1964, and are in the public domain. 
Click here to access
Estonian
No. Participants: 10 
No. Recordings: 1,040
Filetype: .WAV
Language: Estonian 
Description: 26 text passages read by 10 speakers, covering 4 main emotions: joy, sadness, anger, and neutral.
Click here to access/
AESDD
No. Participants: 10 
No. Recordings: 500
File Size: 391Mb
Filetype: .WAV
Language: Greek 
Description: The Acted Emotional Speech Dynamic Database (AESDD) is a publically available speech emotion recognition dataset. It contains utterances of acted emotional speech in the Greek language. The dataset consists of 500 utterances recorded by a diverse group of actors covering 5 different emotions: anger, disgust, fear, happiness, and sadness.
Click here to access/
Microsoft Speech Corpus (Indian languages)
No. Recordings: 124,599
Filetype: .WAV
Languages: Telugu; Tamil; Gujarati
Description: Microsoft Speech Corpus (Indian languages) release contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages. The data package includes audio and corresponding transcripts.
Click here to access
Tunisian
No. Participants: 118
No. Recordings: 11.2 hours
File Size: 391Mb
Filetype: .WAV
Language: Greek 
Description: MSA Modern Standard Arabic (Tunisia)
118 speakers
Click here to access
AISHELL-1
No. Participants: 400
File Size: 15Gb
Filetype: .WAV
Language: Mandarin
Description: Aishell is an open-source Chinese Mandarin speech corpus published by Beijing Shell Shell Technology Co., Ltd.
400 people from different accent areas in China are invited to participate in the recording, which is conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz. The manual transcription accuracy is above 95%, through professional speech annotation and strict quality inspection. The data is free for academic use.
Click here to access
Malayalam Speech Corpus
No. Participants: 75
File Size: 326Mb
Filetype: .WAV
Language: Malayalam
Description: The Malayalam Speech Corpus (MSC) is one of the first open speech corpora for Automatic Speech Recognition (ASR) research and consists of 250 hours of Agricultural speech data involving 3 female, 12 male, and 60 unidentified participants. 
Click here to access
Google Malayalam
No. Participants: 24
File Size: 1.345Gb
Filetype: .WAV
Language: Malayalam
Description: This data set contains transcribed high-quality audio of Malayalam sentences recorded by volunteers.
Click here to access
Facebook AI is releasing Multilingual LibriSpeech
File Size: 3Tb
Filetype: .WAV
Languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish
Description: Multilingual Librispeech (MLS) is a large-scale, open-source data set designed to help advance research in automatic speech recognition (ASR). MLS is designed to help the speech research community work in languages beyond just English so people around the world can benefit from improvements in a wide range of AI-powered services.
Click here to access
The BABEL Project
Filetype: .WAV
Language: Bulgarian, Estonian, Hungarian, Polish, and Romanian
Description: BABEL was a joint European project under the COPERNICUS scheme comprising partners from a number of Eastern and Western European research centers. BABEL has produced a multi-language database comprising five of the most widely differing Eastern European languages.
Click here to access
Living Audio Dataset
Languages: Dutch, English, Irish, Russian
Description: A “Crowd-Built” continuously growing speech dataset with transcripts. The dataset contains multiple languages and is intended for anyone to be able to add to it.
Click here to access
Microsoft Speech Language Translation Corpus
No. Recordings: 61,270
File Size: 326Mb
Filetype: .WAV
Languages: English, Chinese, Japanese
Description: The Microsoft Speech Language Translation Corpus release contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese collected by Microsoft Research. The package includes audio data, transcripts, and translations and allows end-to-end testing of spoken language translation systems on real-world data.
Click here to access

Conclusion

We hope that this list of 150+ open datasets has been helpful in assisting your model-building journey. If there are any open datasets you would like us to add to the list, then please let us know here.

Found a dataset you’d like to use? Click here to access

Need a custom dataset specific to your project? Contact us here


Twine

Twine's platform curates the best quality creative freelancers to grow your business, saving time and money whilst ensuring quality results on your projects.


Fatal error

: Uncaught Error: Call to undefined function Smush\Core\Parser\str_starts_with() in /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/parser/class-image-url.php:119 Stack trace: #0 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/parser/class-image-url.php(98): Smush\Core\Parser\Image_URL->is_scheme_missing_from_original() #1 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/parser/class-image-url.php(91): Smush\Core\Parser\Image_URL->prepare_absolute_url() #2 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/lazy-load/class-lazy-load-transform.php(352): Smush\Core\Parser\Image_URL->get_absolute_url() #3 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/lazy-load/class-lazy-load-transform.php(312): Smush\Core\Lazy_Load\Lazy_Load_Transform->maybe_lazy_load_image_element() #4 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/lazy-load/class-lazy-load-transform.php(304): Smush\Core\Lazy_Load\Lazy_Load_Transform->transform_image_element() #5 /var/ww in /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/parser/class-image-url.php on line 119