The Best Bengali Language Datasets of 2022

Bengali is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Bengali language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Bengali Language datasets.

Are you ready?

Let’s dive in.

Here are our top picks for Bengali Language datasets:

Bengali Text to Speech Dataset

This dataset contains multi-speaker high quality transcribed audio data for Bengali. There are two zip files, one for each local which contains a file: line_index.tsv and the wave files. The line index has a fileID and the transcription and has been manually quality-checked, but there might still be errors. 

Access the dataset

Numta Handwritten Bengali Digits

The dataset is a compilation of six datasets that were gathered from different sources. However, each of them was checked rigorously under the same criteria, so that all digits were legible to at least one human being without any prior knowledge. The initial release of the NumtaDB dataset was used for the Bengali.AI Computer Vision Challenge. It was found that the testing set consisted of some illegible and ambiguous digits. These digits are replaced by legible digits of the same label. 

Access the dataset

Bengali Hate Speech

Introduces three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively.

Access the dataset

KU-BdSL Sign Language Dataset

The KU-BdSL Sign Language Dataset includes three variants of data. The dataset consists of images representing single-hand gestures for BdSL alphabets. Several smartphones are taken into account to capture images from 33 participants (25 males and 8 females). Each version includes 30 classes that resemble the 39 consonants (‘shoroborno‘) of Bengali alphabets. There is a total of 1,500 images in jpg format in each variant. The images are captured on flat surfaces at different times of the day to vary the brightness and contrast. 

Access the dataset

BanglaLM Dataset

This dataset contains content curated from social media, blogs, newspapers, wiki pages, and other similar resources. The amount of samples in this dataset is 19132010, and the length varies from 3 to 512 words. This dataset can easily be used to build any unsupervised machine learning model with the aim of performing necessary NLP tasks involving the Bengali language. 

Access the dataset

Wrapping up

To conclude, here are top picks for the best Bengali language datasets for your projects:

  1. Bengali Text to Speech Dataset
  2. Numta Handwritten Bengali Digits
  3. Bengali Hate Speech
  4. KU-BdSL Sign Language Dataset
  5. BanglaLM Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you would like to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.