100+ Speech Datasets

Speech datasets are among the most sought-after datasets by AI/ML professionals. 

Despite their popularity, it’s not always easy to find speech datasets in the wild. As data is needed to train your models, it’s important you get the requirements right. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best speech datasets and provided an up-to-date archive. We’ve also categorized each dataset, so you can navigate through this list with ease.

Are you ready to find a dataset to suit your project?

Let’s dive in.


Here are our top picks for Speech Datasets:

Languages:

Indonesian Datasets

Holds multiple dataset topics including automobile platforms, Wiki revision history, news websites, comments and reviews from online sources, Twitter texts, and monolingual data processing.

Japanese Datasets

Holds multiple dataset topics including linguistic phenomena, monolingual data, Japanese documentation, parallel sentence analysis, handwriting training, and machine reading comprehension.

Portuguese Datasets

Holds multiple dataset topics including instructional videos, Portuguese translation, public research, Twitter texts, and Facebook posts.

Arabic Datasets

Holds multiple dataset topics including YouTube content, handwriting analysis, dialect speech data, telephone conversations, and transcribed scripted speech.

German Datasets

Holds multiple dataset topics including image and text pairing, media transcriptions, news articles, monolingual data, novel citations, and emotion classification.

Indian Datasets

Holds multiple dataset topics including conversational speech training, dialect training, isolated word samples, and text data analysis. 

French Datasets

Holds multiple dataset topics including speech style analysis, monolingual data, pronunciation transcription, telephone conversation, Wiki articles, and command and query speech.

Spanish Datasets

Holds multiple dataset topics including conversational speech, audio transcription, weather recordings, telephone conversation, and speech development analysis.

English Datasets

Holds multiple dataset topics including conversational speech, old movie speech data, telephone conversations, pronunciation transcription, speech recognition, linguistic speech analysis, and dialect speech data.

Dialects:

Mandarin Datasets

Holds multiple dataset topics including speech recognition, emotional speech analysis, YouTube and Podcast speech data, sentence transcription, automatic speech scoring, and news broadcasting speech.

Miscellaneous:

Natural Language Processing Datasets

Holds multiple dataset topics including audiobook passages, speech recognition, dialect speech analysis, detailed reading of the New Testament, multilingual speech-to-text translation, language classification, caption annotation, telephone conversations, and TedTalk transcripts.


Wrapping up

We hope that this list helped you find a dataset for your project. Hopefully, this has also made you realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then let us know here.

If you would like to find out more about building a custom dataset for your project, please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.