Natural language processing is often significant when used in AI/ML models. Often, it requires a lot of data and training to handle these datasets correctly. For those interested in NLP Speech Datasets, Twine has brought together our top selection – so you don’t have to go looking.
Are you ready?
Let’s dive into our list of the best NLP speech datasets in 2022.
Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here.
Here are our top picks for NLP speech datasets:
1. Biggest Audiobook NLP Dataset
The LJ Speech Dataset contains passages from audiobooks – varying in length. These segments of audio originate from a singular speaker, with transcriptions provided.
- Over 13,000+ short audiobook passages
- Verified by a human reader
- Includes transcription of audio
- The LibriSpeech ASR Corpus Dataset contains 1000 hours of English-read speech from audiobooks.
- The Libri-Light Dataset contains unlabelled speech from audiobooks in English. Dataset is in JSON file format.
2. Best Speech Recognition NLP Dataset
The TIMIT Acoustic-Phonetic Continuous Speech Dataset contains spoken American-English, for the development and evaluation of automatic speech recognition systems.
- 630 speakers of eight major American English dialects
- 16-bit, 16kHz speech waveform file for each utterance
- Includes transcription, which have been hand-verified
- The VoxForge Dataset is open-source, with all audio files submitted under the GPL license.
- The MISC Dataset was created by Microsoft in 2018, and contains recordings of information-seeking conversations between human “seekers” and “intermediaries”.
3. Best Multilingual NLP Speech Dataset
The Multilingual Corpus of Sentence-Aligned Spoken Utterances (MaSS) Dataset was Created by Boito in 2020, and provides detailed readings of the New Testament. The vast size of this dataset allows models to be potentially built for 700 languages.
- 8,130 parallel spoken utterances across 8 languages (56 language pairs)
- Languages: Basque, English, Finnish, French. Hungarian, Romanian, Russian, Spanish
- The CoVoST Dataset is a multilingual speech-to-text translation corpus covering translations from 21 languages into English and from English into 15 languages. The overall speech duration is 2,880 hours.
- OSCAR Dataset or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual dataset containing language classification and filtering of the Common Crawl corpus using the goclassy architecture. Incorporates 138 GB of text.
4. Best Language Modelling NLP Speech Dataset
The Clotho Dataset is built with a focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English-speaking countries.
- 4981 audio samples of 15 to 30 seconds duration
- 24 905 captions of eight to 20 words length
- Gender-balanced participants
- The MetaQA Dataset consists of a movie ontology derived from the WikiMovies Dataset and three sets of question-answer pairs written in natural language: 1-hop, 2-hop, and 3-hop queries.
- TED-LIUM 3 is an audio dataset that contains 452 hours of audio, over 15,900 pronunciations, from 2351 audio talks in NIST sphere format (SPH)
5. Best English NLP Speech Dataset
The 2000 HUB5 English Dataset contains transcripts originally derived from 40 English-speaking telephone conversations, this dataset contains a slew of speech files for NLP.
- Conversational speech over the telephone
- 20 telephone conversations between native English speakers
- The Free Spoken Digit Dataset is open – it contains English pronunciations from 6 speakers, leading to 3,000 recordings (50 of each digit per speaker).
- The TIMIT Dataset is designed for the development of automatic speech recognition systems, featuring over 600 unique American-English speakers reading from 10 ‘phonetically rich’ passages.
6. Best International NLP Speech Dataset
The MuST-C Dataset is a speech translation corpus containing 385 hours from Ted talks for speech translation from English into several languages. The aim of the dataset is to facilitate the training of end-to-end systems for SLT from English into 8 languages.
- Languages featured: Dutch, French, German, Italian, Portuguese, Romanian, Russian, & Spanish
- 385 Hours
- Male & female speakers
- The Europarl-ST Dataset contains paired audio-text samples for speech translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012. Contains 6 Euro languages: German, English, Spanish, French, Italian and Portuguese.
- The Microsoft Speech Language Translation Corpus (MSLT) Dataset contains conversational, bilingual speech test and tuning data for English, Chinese, and Japanese. Includes audio, transcripts, and translations.
To conclude, here are top picks for the best NLP Speech datasets for your projects:
- Biggest Audiobook NLP Speech Dataset: LJ Speech Dataset
- Best Speech Recognition NLP Dataset: TIMIT Acoustic-Phonetic Continuous Speech Dataset
- Best Multilingual NLP Speech Dataset: MaSS Dataset
- Best Language Modelling NLP Speech Dataset: Clotho Dataset
- Best English NLP Speech Dataset: 2000 HUB5 English Dataset
- Best International NLP Speech Dataset: MuST-C Dataset
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.