The Best Thai Language Datasets of 2022

Thai is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Thai language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Thai Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Thai Language datasets:

CC100-Thai Dataset

Created in 2020, the CC100-Thai dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 8.7G, exclusively in the Thai language. Contains text files.

Access the dataset

Wisesight Sentiment Corpus Dataset

Created in 2019, the Wisesight Sentiment Corpus Dataset contains around 26,700 messages in the Thai language from various social media with human-annotated sentiment classification (positive, neutral, negative, and question), in the Thai language. Contains 26,700 Text files.

Access the dataset

HSE Thai Corpus Dataset

The Thai language is the primary language of Thailand and a recognized minority language in Cambodia. It has approximately twenty million native speakers, in addition to 44 million second-language speakers. It is written in Thai script (also called the Thai alphabet) which is notable for being the first writing system to incorporate tonal markers. Thai is written without spaces between words.

Access the dataset

THFOOD-50 Dataset

THFOOD-50 Dataset contains 15,770 images of 50 famous Thai dishes. Visual files are used, exclusively in the Thai language.

Access the dataset

490 People Dataset

Thai speech data (guiding) is collected from 490 Thailand native speakers and is recorded in a quiet environment. The recording is rich in content, covering multiple categories such as in-car scenes, smart homes, and speech assistants. 50 sentences for each speaker. The valid volume is 15 hours. All texts are manually transcribed with high accuracy.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Thai language datasets for your projects:

  1. CC100-Thai Dataset
  2. Wisesight Sentiment Corpus Dataset
  3. HSE Thai Corpus Dataset
  4. THFOOD-50 Dataset
  5. 490 People Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you want to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.