100+ Open Audio and Video Datasets

At Twine, we specialize in helping AI companies create high-quality custom audio and video AI datasets.  

During conversations with clients, we often get asked if there are any off-the-shelf audio and video datasets we would recommend, for testing and for them to use as a point of comparison with custom approaches.

When we started searching for lists of datasets it was very surprising how limited they were.

To address this, we have put together a list of 100+ open audio and video datasets. The datasets listed below all contain the number of recordings in each dataset, the number of participants involved, the languages of the speech content, the file size, and file type.

We have also put together an Airtable of this dataset list so that you can easily search, filter, edit and export it yourself. Please click the link below if you would like to access it:

Access the searchable AI Datasets table


100+ Audio and Video Datasets

Audio

Urban Sound 8K dataset
No. Recordings: 8732
File Size: 13.84KB
Filetype: .WAV/.CSV
Language(s): US English
Description: Contains Urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc.
https://urbansounddataset.weebly.com/urbansound8k.html
Mozilla Common Voice
No. Recordings: 75,879
File Size: 63Gb
Filetype: MP3
Language(s): US English
Description: An open-source, multi-language dataset of voices that anyone can use to train speech-enabled applications.
https://commonvoice.mozilla.org/en/datasets
HiEve
No. Recordings: 1,000,000
Filetype: MP4
Language(s): US English
Description: The largest collection of poses which focuses on very challenging and realistic tasks of human-centric analysis in various crowds & complex events, including subway getting on/off, collision, fighting, and earthquake escape
http://humaninevents.org/
Voices Obscured in Complex Environmental Settings (VOICES) Dataset
No. Recordings: 3,903
File Size: 1.3Gb
Filetype: MP3
Language(s): US English
Description: A creative commons speech dataset targeting acoustically challenging and reverberant environments with robust labels and truth data for transcription, denoising, and speaker identification.
https://iqtlabs.github.io/voices/
Free Spoken digit dataset
No. Recordings: 3000
No. Participants: 6
File Size: 10Mb
Filetype: WAV
Language(s): US English
Description: A simple audio or speech data which consists of recordings of spoken English digits
https://github.com/Jakobovski/free-spoken-digit-dataset
The Stereo Human Pose Estimation Dataset
No. Recordings: 630
No. Participants: 26
File Size: 197.8Mb
Filetype: JPEG
Language(s): US English
Description: A dataset of stereo image pairs suited for stereo human pose estimation of upper-body people.
https://www.uco.es/investiga/grupos/ava/node/47
The Spoken Wikipedia Corpora
No. Recordings: 5,397
No. Participants: 879
File Size: 23Gb
Filetype: MP3
Language(s): US English
Description: This is a corpus of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia
https://nats.gitlab.io/swc/
TED-LIUM
No. Recordings: 1,495
Language(s): US English
Description: Audio transcription of TED talks. 1495 TED talks audio recordings along with full-text transcriptions of those recordings
http://www.openslr.org/51/
Speech Commands Dataset
No. Recordings: 65,000
Language(s): US English
Description: 65,000 one-second long utterances of 30 short words, by thousands of different people
http://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html
Persian Consonant Vowel Combination (PCVC) Speech Dataset
No. Recordings: 30,000
No. Participants: 217
Filetype: MAT
Language(s): US English
Description: This dataset contains 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples.
https://github.com/S-Malek/PCVC
Arabic Speech Corpus
No. Recordings: 5439
Filetype: WAV
Language(s): Arabic
Description: Phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with a recorded speech on the phoneme level
http://en.arabicspeechcorpus.com/
TIMIT
No. Recordings: 6,300
No. Participants: 630
Filetype: WAV
Language(s): US English
Description: Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences
https://github.com/philipperemy/timit/blob/master/README.md
Mivia Audio Events Dataset
No. Recordings: 6,000
Filetype: WAV
Language(s): US English
Description: 6,000 events of surveillance applications, namely glass breaking, gunshots, and screams
https://mivia.unisa.it/datasets/audio-analysis/mivia-audio-events/
Urban Sound Dataset
No. Recordings: 1,302
Filetype: WAV
Language(s): US English
Description: 1302 labeled sound recordings. Each recording is labeled with the start and end times of sound events from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music
https://urbansounddataset.weebly.com/urbansound.html
Clotho Dataset
No. Recordings: 4,981
Filetype: MP3
Language(s): US English
Description:
A novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions
https://zenodo.org/record/3490684#.YQEqHVNKg-R
FSD50K
No. Recordings 51,197: 
Filetype: WAV
Language(s): US English
Description:
An open dataset of human-labeled sound events containing Freesound clips unequally distributed in 200 classes
https://zenodo.org/record/4060432#.X3xrgi8RqL4
Vocal Imitation Set v1.1.3
File Size: 7.6Gb
Filetype: WAV
Language(s): US English
Description:
A collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound
https://zenodo.org/record/1340763#.Xlj1By2ZN24
Google Audio set
No. Recordings: 2,084,320
Filetype: WAV
Language(s): US English
Description:
635 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos
https://research.google.com/audioset
CALLHOME American English Speech
No. Recordings: 120
No. Participants: 240
Language(s): US English
Description: 120 unscripted 30-minute telephone conversations between native speakers of English
https://catalog.ldc.upenn.edu/LDC97S42
LibriSpeech ASR Corpus
No. Recordings: 1,000
Filetype: MP3
Language(s): US English
Description: 1,000 hours of 16kHz read English speech
https://www.openslr.org/12
Speech Accent Archive
No. Recordings: 2,140
File Size: 907Mb
Filetype: MP3
Language(s): US English
Description: Parallel English speech samples from 177 countries
https://www.kaggle.com/rtatman/speech-accent-archive
Phone Conversation Data Sample
No. Recordings: 1,822
Filetype: WAV
Language(s): US English
Description: Conversations in Dutch, Japanese, and Irish English
https://summalinguae.com/data-sets/phone-conversation-data/
Alexa Wake Word Voice Samples
No. Recordings: 24
Filetype: WAV
Language(s): US English
Description: Sample of 24 Alexa wake word recordings in four languages
https://summalinguae.com/data-sets/alexa-wake-word-data/
The LJ Speech Dataset
No. Recordings: 1,300
File Size: 2.6Gb
Filetype: CSV
Language(s): US English
Description: Public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books
https://keithito.com/LJ-Speech-Dataset/
AISHELL-2
No. Recordings: 1,000,000
No. Participants: 1,991
Language(s): Mandarin
Description: The largest free speech corpus available for Mandarin ASR research
https://github.com/kaldi-asr/kaldi/tree/master/egs/aishell2
AEDD
No. Recordings: 500
No. Participants: 5
Language(s): US English
Description: 500 utterances by a diverse group of actors (over 5 actors) simulating various emotions
http://m3c.web.auth.gr/research/aesdd-speech-emotion-recognition/
ANAD
No. Recordings: 1,384
No. Participants: 8
File Size: 2Gb
Filetype: WAV
Language(s): US English
Description: 1384 recording by multiple speakers; 3 emotions: angry, happy, surprised
https://www.kaggle.com/suso172/arabic-natural-audio-dataset
AudioMNIST
No. Recordings: 30,000
No. Participants: 60
Filetype: MP3
Language(s): US English
Description: Consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers
https://github.com/soerenab/AudioMNIST
BAVED
No. Recordings: 1,935
No. Participants: 61
File Size: WAV
Filetype: 97.8Mb
Language(s): US English
Description: 1935 recording by 61 speakers (45 male and 16 female).
https://www.kaggle.com/a13x10/basic-arabic-vocal-emotions-dataset
CMU-MOSEI
No. Participants: 1,000
Language(s): US English
Description: 65 hours of annotated video from more than 1000 speakers and 250 topics; 6 Emotions (happiness, sadness, anger, fear, disgust, surprise) + Likert scale.
https://www.amir-zadeh.com/datasets
CMU-MOSI
No. Recordings: 2,199
Language(s): US English
Description: 2199 opinion utterances with annotated sentiment; Sentiment annotated between very negative to very positive in seven Likert steps
https://www.amir-zadeh.com/datasets
CMU Wilderness
No. Participants: 699
Filetype: Mp3
Language(s): US English
Description: Speech dataset with voice actors of many accents reciting passages from the Bible
http://festvox.org/cmu_wilderness/
CREMA-D
No. Recordings: 7,442
No. Participants: 91
File Size: 163Mb
Filetype: GIT-LFS
Language(s): US English
Description: 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities
https://github.com/CheyneyComputerScience/CREMA-D
DAPS Dataset
No. Recordings: 100
No. Participants: 200
Language(s): US English
Description: 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books
https://archive.org/details/daps_dataset
Deep Clustering Dataset
File Size: 12Mb
Filetype: WAV / Mp3 / OGG 
Language(s): US English
Description: Training deep discriminative embeddings to solve the cocktail party problem
https://www.merl.com/demos/deep-clustering
DEMoS
No. Recordings: 9697
No. Participants: 68
Language(s): US English
Description: 9365 emotional and 332 neutral samples produced by 68 native speakers https://zenodo.org/record/2544829
EEKK
No. Recordings: 1234
No. Participants: 10
Filetype: MP3
Language(s): US English
Description: 26 text passages read by 10 speakers; 4 main emotions:
joy, sadness, anger, and neutral
https://metashare.ut.ee/repository/download/4d42d7a8463411e2a6e4005056b40024a19021a316b54b7fb707757d43d1a889/
Emo-DB
No. Recordings: 500
No. Participants: 10
Language(s): US English
Description: 800 recordings spoken by 10 actors (5 males and 5 females); 7 emotions: anger, neutral, fear, boredom, happiness, sadness, disgust
http://emodb.bilderbar.info/index-1280.html
EmoFilm
No. Recordings: 1115
Filetype: WAV
Language(s): US English
Description: 1115 audio instances sentences extracted from various films
https://zenodo.org/record/1326428
Emotional Voice dataset – Nature
No. Recordings: 2519
No. Participants: 100
Language(s): US English
Description: 2,519  speech samples were produced by 100 actors from 5 cultures
https://www.nature.com/articles/s41562-019-0533-6
Emov-DB
No. Recordings: 
No. Participants: 4
File Size: 1.58GB
Language(s): US English
Description: Recordings for 4 speakers- 2 males and 2 females; The emotional styles are neutral, sleepiness, anger, disgust, and amused
https://mega.nz/#F!KBp32apT!gLIgyWf9iQ-yqnWFUFuUHg!mYwUnI4K
EMOVO
No. Recordings: 84
No. Participants: 6
Language(s): US English
Description: 6 actors who played 14 sentences; 6 emotions: disgust, fear, anger, joy, surprise, sadness
http://voice.fub.it/activities/corpora/emovo/index.html
eNTERFACE05
No. Participants: 42
File Size: 801MB
Language(s): US English
Description: Videos by 42 subjects, coming from 14 different nationalities; 6 emotions: anger, fear, surprise, happiness, sadness and disgust
http://www.enterface.net/enterface05/docs/results/databases/project2_database.zip
GEMEP corpus
No. Recordings: 145
No. Participants: 10
Filetype: MP3
Language(s): US English
Description: 10 actors portraying 10 different emotional states
https://www.unige.ch/cisa/gemep
IEMOCAP
No. Participants: 10
Filetype: WAV
Language(s): US English
Description: 12 hours of audiovisual data by 10 actors; 5 emotions: happiness, anger, sadness, frustration, and neutral
https://sail.usc.edu/iemocap/iemocap_release.htm
Keio-ESD
Filetype: WAV
Language(s): US English
Description: A set of human speech with vocal emotion spoken by a Japanese male speaker; 47 emotions including angry, joy, disgusting, downgrading, funny, worried, gentle, relief, indignation, shame, etc.
http://research.nii.ac.jp/src/en/Keio-ESD.html
MSP-IMPROV
No. Recordings: 8,438
No. Participants: 12
Language(s): US English
Description: 20 sentences by 12 actors; 4 emotions: angry, sad, happy, neutral, other, without agreement
https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Improv.html
MSP Podcast Corpus
No. Recordings: 62140
No. Participants: 3260
Language(s): US English
Description: 100 hours by over 100 speakers – annotated with emotional labels using attribute-based descriptors
https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html
NISQA Speech Quality Corpus
No. Recordings: 14,000
No. Participants: 3,260
Language(s): US English
Description: Includes 14k speech samples with simulated (codecs, packet-loss, background noise) and live (mobile phone, Zoom, Skype, WhatsApp) voice call degradation conditions
https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus
OGVC
No. Recordings: 9114 
No. Participants: 4
Language(s): US English
Description: 9114 spontaneous utterances and 2656 acted utterances by 4 professional actors
https://sites.google.com/site/ogcorpus/home/en
RECOLA
No. Participants: 46
Language(s): US English
Description: 3.8 hours of recordings by 46 participants; negative and positive sentiment (valence and arousal)
https://diuf.unifr.ch/main/diva/recola/download.html
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
No. Recordings: 7,356
No. Participants: 247
File Size: 24.8Gb
Filetype: WAV
Language(s): US English
Description: 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically matched statements in a neutral North American accent
https://zenodo.org/record/1188976#.XrC7a5NKjOR
SAVEE Dataset
No. Recordings: 480
No. Participants: 4
Filetype: MP4
Language(s): US English
Description: 4 male actors in 7 different emotions, 480 British English utterances in total
http://kahlan.eps.surrey.ac.uk/savee/
SEMAINE
No. Recordings: 95
No. Participants: 21
Language(s): US English
Description: 95 dyadic conversations from 21 subjects. Each subject converses with another playing one of four characters with emotions
https://semaine-db.eu/
ShEMO3000 
No. Recordings: 3,000
No. Participants: 87
Filetype: WAV
Language(s): US English
Description: Semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data from online radio plays by 87 native-Persian speakers
https://github.com/mansourehk/ShEMO
Spoken Commands dataset
No. Recordings: 10,000,000
File Size: 10MB per word
Language(s): US English
Description: A testbed for voice activity detection algorithms and for recognition of syllables (single-word commands). 3 speakers, 1,500 recordings (50 of each digit per speaker), English pronunciations
https://github.com/JohannesBuchner/spoken-command-recognition
Tess
No. Recordings: 2,800
No. Participants: 2
Filetype: WAV
Language(s): US English
Description: 2,800 recordings by 2 actresses; 7 emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutrality.
https://tspace.library.utoronto.ca/handle/1807/24487
Thorsten dataset
No. Recordings: 22668
Filetype: WAV
Language(s): US English
Description: German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average.
https://github.com/thorstenMueller/deep-learning-german-tts/
URDU-Dataset
No. Recordings: 400
No. Participants: 38
Filetype: WAV
Language(s): US English
Description: 400 utterances by 38 speakers (27 male and 11 female); 4 emotions: angry, happy, neutral, and sad.
https://github.com/siddiquelatif/urdu-dataset
VCTK dataset
No. Recordings: 44,000
No. Participants: 110
File Size: 10.94GB
Filetype: TXT
Language(s): US English
Description: 110 English speakers with various accents; each speaker reads out about 400 sentences. Samples are mostly 2–6 s long, at 48 kHz 16 bits, for a total dataset size of ~10 GiB.
https://datashare.is.ed.ac.uk/handle/10283/3443
VIVAE
No. Recordings: 1,085
No. Participants: 12
File Size: 93.5MB
Filetype: VIVAE
Language(s): US English
Description: Non-speech, 1085 audio files by ~12 speakers; non-speech 6 emotions: achievement, anger, fear, pain, pleasure, and surprise with 3 emotional intensities (low, moderate, strong, peak).
https://zenodo.org/record/4066235
VoxPopuli
No. Recordings: 400,000
File Size: 6.4T
Filetype: WAV
Language(s): US English
Description: 100K hours of unlabelled speech data for 23 languages, 1.8K hours of transcribed speech data for 16 languages, and 17.3K hours of speech-to-speech interpretation data for 16×15 directions.
https://github.com/facebookresearch/voxpopuli

Video

Twenty Billion Neurons Crowd Acting video dataset collection
No. Recordings: 220847
File Size: 19.4GB
Filetype: WEBM
Language(s): US English
Description: Large-scale Human-centric Video Analysis in Complex Events
https://20bn.com/products/datasets
The VIRAT Video Dataset
No. Recordings: 262
File Size: 12MB
Filetype: PDF
Language(s): US English
Description: The VIRAT Video Dataset is designed to be realistic, natural, and challenging for video surveillance domains in terms of its resolution, background clutter, diversity in scenes, and human activity/event categories than existing action recognition datasets
https://viratdata.org/
The WebVid-10M Dataset
No. Recordings: 10700000
File Size: 2.5MB
Filetype: MP4
Language(s): US English
Description: A large-scale dataset of short videos with textual descriptions sourced from the web
https://m-bain.github.io/webvid-dataset/
The MECCANO Dataset
No. Recordings: 73206
No. Participants: 93
File Size: 32.3GB
Filetype: MP4
Language(s): US English
Description: The first dataset of egocentric videos to study human-object interactions in industrial-like settings.
https://iplab.dmi.unict.it/MECCANO/
Moments In Time
No. Recordings: 1,000,000
File Size: 150MB
Filetype: MP4
Language(s): US English
Description: A large-scale dataset for recognizing and understanding action in videos
http://moments.csail.mit.edu/
Something Something Dataset
No. Recordings: 220847
File Size: 19.4GB
Filetype: WEBM
Language(s): US English
Description: A large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects
https://20bn.com/datasets/something-something
BDD100K
No. Recordings: 100000
File Size: 3.9GB
Filetype: MP4
Language(s): US English
Description: Comprises ten tasks and 100K videos to estimate the progress of image recognition algorithms on autonomous driving
https://github.com/bdd100k/bdd100k
Kinetics-700
No. Recordings: 650,000
File Size: 24.3MB
Filetype: MP4
Language(s): US English
Description: A large, high-quality video dataset of URL links to approximately 650000 Youtube video clips that cover 700 human action classes.
https://deepmind.com/research/open-source/kinetics
Casual Conversations Dataset
No. Recordings: 45,186
No. Participants: 3011
File Size: 15GB
Filetype: MP4
Language(s): US English
Description: 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models in computer vision and audio applications
https://ai.facebook.com/datasets/casual-conversations-dataset/
VoxCeleb
No. Recordings: 1,000,000
No. Participants: 7,000
File Size: 133MB
Filetype: MP4
Language(s): US English
Description: An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube
https://www.robots.ox.ac.uk/~vgg/data/voxceleb/
TV Human Interaction Dataset
No. Recordings: 300
File Size: 156MB
Filetype: MP4
Language(s): US English
Description: 300+ videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss
http://www.robots.ox.ac.uk/~alonso/tv_human_interactions.html
THUMOS Dataset
No. Recordings: 25,000,000
File Size: 385KB
Filetype: MP4
Language(s): US English
Description: A large collection of video clips of different kinds; the dataset can be used for action classification
https://www.crcv.ucf.edu/THUMOS14/home.html
50 Salads Dataset
No. Participants: 25
File Size: 31GB
Filetype: RGB
Language(s): US English
Description: Fully annotated 4.5-hour dataset of RGB-D video + accelerometer data, capturing 25 people preparing two mixed salads each.
http://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads/
YoutubeFace
No. Recordings: 3425
No. Participants: 1595
Filetype: MP4
Language(s): US English
Description: A database of face videos designed for studying the problem of unconstrained face recognition in videos.
http://www.cs.tau.ac.il/~wolf/ytfaces/
PaSc
No. Recordings: 9376
No. Participants: 293
Language(s): US English
Description: Facial recognition 9,376 still images and 2,802 videos of 293 people
https://www.nist.gov/publications/challenge-face-recognition-digital-point-and-shoot-cameras
iQIYI-VID
No. Recordings: 600000
No. Participants: 5000
Filetype: MP4
Language(s): US English
Description: The largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities.
https://arxiv.org/pdf/1811.07548.pdf
COIN
No. Recordings: 11827
File Size: 8.47MB
Filetype: JSON
Language(s): US English
Description: 11,827 videos related to 180 different tasks, which were all collected from YouTube
https://coin-dataset.github.io/
CityScapes
No. Recordings: 25000
File Size: 51.92GB
Filetype: JPG
Language(s): US English
Description: A large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities
https://www.cityscapes-dataset.com/dataset-overview/
AVA-Kinetics Dataset
No. Recordings: 3650000
No. Participants: 39000
File Size: 7.7MB
Filetype: CSV
Language(s): US English
Description: AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity.
https://research.google.com/ava/index.html
Activity Net
No. Recordings: 20,194
File Size: 600GB
Filetype: JSON
Language(s): US English
Description: A Large-Scale Video Benchmark for Human Activity Understanding
http://activity-net.org/
Kinetics
No. Recordings: 650000
File Size: 24.3MB
Filetype: MP4
Language(s): US English
Description: A collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.
https://deepmind.com/research/open-source/kinetics
Yahoo-Flickr Creative Commons 100 Million Dataset
No. Recordings: 100000000
File Size: 15GB
Filetype: MP4
Language(s): US English
Description: The YFCC100M is the largest publicly and freely usable multimedia collection, containing  around 99.2 million photos and 0.8 million videos from Flickr, all of which were shared under one of the various Creative Commons licenses
http://multimediacommons.org/
UMDFaces
No. Recordings: 4067888
No. Participants: 11377
File Size: 173MB
Filetype: MP4
Language(s): US English
Description: UMDFaces is a face dataset divided into two parts: Still Images – 367,888 face annotations for 8,277 subjects and Video Frames – Over 3.7 million annotated video frames from over 22,000 videos of 3100 subjects.
https://www.umdfaces.io/
Condensed Movies
No. Recordings: 462,000
File Size: 250GB
Filetype: MP4
Language(s): US English
Description: A large-scale video dataset, featuring clips from movies with detailed captions
https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/
AVSpeech
No. Recordings: 290,000
File Size: 128MB
Filetype: MP4
Language(s): US English
Description: AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering background noises
https://looking-to-listen.github.io/avspeech/
EyeC3D
No. Participants: 21
File Size: 3.9GB
Language(s): US English
Description: 3D video eye tracking dataset
https://www.epfl.ch/labs/mmspg/downloads/eyec3d/
MoVi
No. Recordings: 1890
No. Participants: 90
File Size: 1.3MB
Filetype: MP4
Language(s): US English
Description: A large multi-purpose human motion and video dataset
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0253157
Thör
No. Recordings: 22668
File Size: WAV
Language(s): US English
Description: A public dataset of human motion trajectories, recorded in a controlled indoor experiment.
http://thor.oru.se/
SEWA
No. Participants: 398
Filetype: WAV
Language(s): US English
Description: More than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal.
https://db.sewaproject.eu/

Other Languages

The SIWIS French Speech Synthesis Database
No. Recordings: 9,750
File Size: 2.671Gb
Filetype: .WAV
Language(s): French
Description: High-quality French speech recordings and associated text files, aimed at building TTS systems, investigate multiple styles, and emphasis
https://datashare.ed.ac.uk/handle/10283/2353
TCOF : Traitement de Corpus Oraux en Français
No. Recordings: 626
Filetype: .WAV
Language(s): French
Description: The corpus made available includes two main categories: recordings of adult-child interactions (children up to 7 years old) and recordings of interactions between adults. The recordings are of various durations: from 5 to 45 minutes or more. 
https://www.ortolang.fr/market/corpora/tcof
African Accented French
No. Participants: 84
File Size: 1.8Gb
Filetype: .WAV
Language(s): French
Description: This corpus consists of approximately 22 hours of speech recordings. It has recordings from 84 speakers, 48 male, and 36 female.
http://www.openslr.org/57/
Fisher Spanish Speech
No. Participants: 136
No. Recordings: 819
Filetype: .WAV
Language(s): Spanish
Description: This corpus consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.
https://catalog.ldc.upenn.edu/LDC2010S01
CallFriend – Spanish Corpus
No. Participants: 120
No. Recordings: 60
Filetype: .WAV
Language(s): Spanish
Description: The CallFriend Spanish corpus of telephone speech was collected by the Linguistic Data Consortium primarily in support of the project on Language Identification (LID), sponsored by the U.S. Department of Defense and consists of 60 unscripted telephone conversations between native speakers of Spanish for each dialect group
https://ca.talkbank.org/access/CallFriend/spa.html
TV3Parla
File Size: 27.6Gb 
Filetype: .WAV
Language(s): Catalan
Description: This corpus includes 240 hours of Catalan speech from broadcast material.
http://laklak.eu/share/tv3_0.3.tar.gz
emotiontts_open_db
Filetype: .WAV
Language(s): Korean
Description: Recordings and their associated transcriptions by a diverse group of speakers covering 4 emotions: general, joy, anger, and sadness.
https://github.com/emotiontts/emotiontts_open_db
Pansori TEDxKR
No. Participants: 41
No. Recordings: 60
File Size: 174Mb 
Filetype: .FLAC
Language(s): Korean
Description: The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers. This corpus was generated by using a new corpus data ingestion and processing system called Pansori.
http://www.openslr.org/58/
EMOVO
No. Participants: 6
No. Recordings: 84
File Size: 237Mb 
Filetype: .WAV
Language(s): Italian
Description: This dataset consists of 6 actors who recite 14 sentences in 6 different emotions: disgust, fear, anger, joy, surprise, sadness.
http://voice.fub.it/activities/corpora/emovo/index.html
Online gaming voice chat corpus (OGVC)
No. Participants – 17
No. Recordings: 2,656
Filetype: .WAV
Language(s): Japanese
Description: This speech material contains 2,656 acted utterances spoken by four professional actors (two male and two female). 17 short dialogues were selected  from the dialogues recorded for the naturalistic emotional speech. The actors were instructed to speak each utterance in the short dialog with a specific emotion in three different levels of emotional intensity.
http://research.nii.ac.jp/src/en/OGVC.html
Keio University Japanese Emotional Speech Database (Keio-ESD)
No. Participants – 1
Filetype: .WAV
Language(s): Japanese
A set of human speech with vocal emotion spoken by a Japanese male speaker and a set of artificial speech that were synthesized by a system that had been developed using the subset of this database for training.
http://research.nii.ac.jp/src/en/Keio-ESD.html
NST Danish ASR Database
No. Participants – 616
No. Recordings: 229,992
Filetype: .WAV
Language(s): Danish
Description: This database was created by Nordic Language Technology for the development of automatic speech recognition and dictation in Danish.
https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/
NST Danish Dictation
No. Participants – 151
No. Recordings: 34,955
Filetype: .WAV
Language(s): Danish
Description: This database contains speech data for Danish, made for dictation.
https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/
NST Danish Speech Synthesis
No. Participants – 1
No. Recordings: 4,108
Filetype: .WAV
Language(s): Danish
Description: This database contains speech data for Danish, made for speech synthesis.
https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/
FT Speech
No. Participants: 434 
No. Recordings: 1,017,244
Filetype: .WAV
Language: Danish
Description: FT Speech is a new speech corpus created from the recorded meetings of the Danish Parliament, also known as the Folketing (FT). It contains over 1,800 hours of transcribed speech by a total of 434 speakers, which are partitioned into five subsets with no speaker overlap between train, development, and test data.
https://ftspeech.dk
FalaBrasil-LaPS Benchmark
No. Participants: 35 
No. Recordings: 700
Filetype: .WAV
Language: Portuguese 
Description: LaPS is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded at 22.05 kHz without environment control.
https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT
9Y7XRu02aAvDIo
M-AILABS Polish Corpus
No. Participants: 35 
No. Recordings: 700
File Size: 110Gb 
Filetype: .WAV
Language: Polish 
Description: The M-AILABS Speech Dataset is the first large dataset that we are providing free-of-charge, freely usable as training data for speech recognition and speech synthesis.
Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly 1,000 hours of audio and text files in a prepared format. The texts were published between 1884 and 1964, and are in the public domain. 
http://www.caito.de/data/Training/stt_tts/pl_PL.tgz
Estonian
No. Participants: 10 
No. Recordings: 1,040
Filetype: .WAV
Language: Estonian 
Description: 26 text passages read by 10 speakers, covering 4 main emotions: joy, sadness, anger, and neutral.
https://metashare.ut.ee/repository/download/4d42d7a
8463411e2a6e4005056b40024a19021a316b54b7fb707757d43d1a889/
AESDD
No. Participants: 10 
No. Recordings: 500
File Size: 391Mb
Filetype: .WAV
Language: Greek 
Description: The Acted Emotional Speech Dynamic Database (AESDD) is a publically available speech emotion recognition dataset. It contains utterances of acted emotional speech in the Greek language. The dataset consists of 500 utterances recorded by a diverse group of actors covering 5 different emotions: anger, disgust, fear, happiness, and sadness.
http://m3c.web.auth.gr/research/aesdd-speech-emotion-recognition/
Microsoft Speech Corpus (Indian languages)
No. Recordings: 124,599
Filetype: .WAV
Languages: Telugu; Tamil; Gujarati
Description: Microsoft Speech Corpus (Indian languages) release contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages. The data package includes audio and corresponding transcripts.
https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e
Tunisian
No. Participants: 118
No. Recordings: 11.2 hours
File Size: 391Mb
Filetype: .WAV
Language: Greek 
Description: MSA Modern Standard Arabic (Tunisia)
118 speakers
http://www.openslr.org/46/
AISHELL-1
No. Participants: 400
File Size: 15Gb
Filetype: .WAV
Language: Mandarin
Description: Aishell is an open-source Chinese Mandarin speech corpus published by Beijing Shell Shell Technology Co., Ltd.
400 people from different accent areas in China are invited to participate in the recording, which is conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz. The manual transcription accuracy is above 95%, through professional speech annotation and strict quality inspection. The data is free for academic use.
http://www.openslr.org/33/
Malayalam Speech Corpus
No. Participants: 75
File Size: 326Mb
Filetype: .WAV
Language: Malayalam
Description: The Malayalam Speech Corpus (MSC) is one of the first open speech corpora for Automatic Speech Recognition (ASR) research and consists of 250 hours of Agricultural speech data involving 3 female, 12 male, and 60 unidentified participants. 
https://releases.smc.org.in/msc-reviewed-speech/
Google Malayalam
No. Participants: 24
File Size: 1.345Gb
Filetype: .WAV
Language: Malayalam
Description: This data set contains transcribed high-quality audio of Malayalam sentences recorded by volunteers.
http://www.openslr.org/63/
Facebook AI is releasing Multilingual LibriSpeech
File Size: 3Tb
Filetype: .WAV
Languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish
Description: Multilingual Librispeech (MLS) is a large-scale, open-source data set designed to help advance research in automatic speech recognition (ASR). MLS is designed to help the speech research community work in languages beyond just English so people around the world can benefit from improvements in a wide range of AI-powered services.
https://ai.facebook.com/blog/a-new-open-data-set-for-multilingual-speech-research/
The BABEL Project
Filetype: .WAV
Language: Bulgarian, Estonian, Hungarian, Polish, and Romanian
Description: BABEL was a joint European project under the COPERNICUS scheme comprising partners from a number of Eastern and Western European research centers. BABEL has produced a multi-language database comprising five of the most widely differing Eastern European languages.
http://www.reading.ac.uk/AcaDepts/ll/speechlab/babel/
Living Audio Dataset
Languages: Dutch, English, Irish, Russian
Description: A “Crowd-Built” continuously growing speech dataset with transcripts. The dataset contains multiple languages and is intended for anyone to be able to add to it.
https://github.com/Idlak/Living-Audio-Dataset
Microsoft Speech Language Translation Corpus
No. Recordings: 61,270
File Size: 326Mb
Filetype: .WAV
Languages: English, Chinese, Japanese
Description: The Microsoft Speech Language Translation Corpus release contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese collected by Microsoft Research. The package includes audio data, transcripts, and translations and allows end-to-end testing of spoken language translation systems on real-world data.
https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187

Conclusion

We hope that this list has been helpful in assisting you to either find a dataset for your project or generally see the myriad of options available to you.

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us here.


Twine

Twine

Twine's platform curates the best quality creative freelancers to grow your business, saving time and money whilst ensuring quality results on your projects.