7 Best Portuguese Language Speech Datasets of 2022

Portuguese is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Portuguese language speech datasets with a specific dialect or type of speech to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Portuguese Language speech datasets.

Are you ready?

Let’s dive into our list of the best Portuguese Language speech datasets in 2022.


Here are our top picks for Portuguese Language speech datasets:

1. How2 Dataset

Created by Sanabria in 2018, the How2 Dataset of instructional videos covers a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles., in Portuguese, English language. Containing ~2,000 Hours.

Access the dataset

2. Portuguese SQuAD V1.1 Dataset

Created by Carvalho et al. in 2019, the Portuguese SQuAD v1.1 Portuguese translation of the SQuAD dataset. The translation was performed using the Google Cloud API., in Portuguese language. Containing ~100,000 in JSON file format.

Access the dataset

3. OSCAR Corpus Dataset

Created by Pedro et al. in 2020, the BrWaC (Brazilian Portuguese Web as Corpus) or OSCAR Corpus is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes., in the Portuguese language. In-Text file format.

Access the dataset

4. DNLT-BP Dataset

Created by the Datasets of Neuropsychological Language Tests in Brazilian Portuguese (DNLT-BP). This dataset contains data collected from participants in clinical or academic studies and research, by reading and signing the Informed Consent Form, and the research was evaluated and approved by the Research Ethics Committees of the institutions to which they are linked, in the Portuguese language. In-Text file format.

Access the dataset

5. PortugueseGLUE Dataset

The PortugueseGLUE contains a Portuguese translation of the GLUE benchmark and Scitail dataset using the OPUS-MT model and Google Cloud Translation. Collected in the Portuguese language, in Text file format.

Access the dataset

6. TweetSentBR Dataset

Created by 2020, the TweetSentBR contains sentiment polarity classification with 800k tweets in Portuguese divided into positive, negative, and neutral classes. Collected in the Portuguese language, in Text file format

Access the dataset

7. B5 Corpus Dataset

Created by Ramos et al. in 2018, the B5 Corpus Dataset is a collection of Facebook posts, including information about Brazilian authors, like gender, age, personality score (Based on the B5 test), education level, political position, religion, and others., in the Portuguese language. Containing 1012 in CSV file format.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Portuguese Language Speech datasets for your projects:

  1. How2 Dataset
  2. Portuguese SQuAD V1.1 Dataset
  3. OSCAR Corpus Dataset
  4. DNLT-BP Dataset
  5. PortugueseGLUE Dataset
  6. TweetSentBR Dataset
  7. B5 Corpus Dataset

We hope that this list has either helped you find a dataset for your project. Alternatively, we’ve hoped this list has allowed you to realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.