The Best Burmese Language Datasets of 2022

Burmese is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Burmese language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Burmese Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Burmese Language datasets:

CC100-Burmese Dataset

Created in 2020, the CC100-Burmese dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 46M, exclusively in the Burmese language. Contains text files.

Access the dataset

Crowdsourced Burmese Speech Dataset

This dataset contains transcribed high-quality audio of Burmese sentences recorded by volunteers. The dataset consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains an anonymized FileID and the transcription of audio in the file.

Access the dataset

A Corpus of Modern Burmese Dataset

This is a corpus of modern Burmese compiled by John Okell in the 1990s and converted into Unicode more recently. Contains transcripts of over 1,500 conversations. Exclusively in the Burmese language, contains text files.

Access the dataset

Myanmar-English Parallel Dataset

The UCSY corpus is constructed by the NLP Lab, University of Computer Studies, Yangon (UCSY), Burmese, aiming to promote machine translation research on the Burmese language. The corpus consists of 200 thousand Myanmar-English parallel sentences collected from different domains, including news articles and textbooks.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Burmese language datasets for your projects:

  1. CC100-Burmese Dataset
  2. Crowdsourced Burmese Speech Dataset
  3. A Corpus of Modern Burmese Dataset
  4. Myanmar-English Parallel Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you would like to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.