The Best Malaysian Language Datasets of 2022

Malaysian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Malaysian language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Malaysian Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Malaysian Language datasets:

CC100-Malay Dataset

Created in 2020, the CC100-Malay dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 2.1G, exclusively in the Malaysian language. Contains Text files.

Access the dataset

Malay Conversational Speech Corpus

This open-source dataset consists of 5 hours of transcribed Malay conversational speech on certain topics, where ten conversations between five pairs of speakers were contained. Recorded on mobile, all data was recorded in an indoor environment. 16 kHz, exclusively in the Malaysian language.

Access the dataset

MULTIGLOSS MALAY DICTIONARY Dataset

A series of innovative multilingual glossaries, based on a human-edited bilingual index of each language to English that is semi-automatically generated to translations in 45 more languages, is currently available for 22 languages. Contains text files.

Access the dataset

IMPROVING ACCURACY IN SENTIMENT ANALYSIS FOR MALAY LANGUAGE Dataset

This paper presents a hybrid approach of a Knowledge Base approach combined with Machine Learning, implemented in the Mi-Intelligence Sentiment Analysis system. The proposed approach supports multilingualism and is applied to text articles written in the Malay (Bahasa Melayu) language. A dataset in the Malay language is manually annotated with sentiment values and used for performance evaluation. Exclusively in the Malaysian language, contains text files.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Malaysian language datasets for your projects:

  1. CC100-Malay Dataset
  2. Malay Conversational Speech Corpus
  3. MULTIGLOSS MALAY DICTIONARY Dataset
  4. IMPROVING ACCURACY IN SENTIMENT ANALYSIS FOR MALAY LANGUAGE Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you want to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.