The Best Tamil Language Datasets of 2022

Tamil is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Tamil language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Tamil Language datasets.

Are you ready?

Let’s dive in.

Here are our top picks for Tamil Language datasets:

CC100-Tamil dataset

Created in 2020, the CC100-Tamil dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 1.3G, exclusively in the Tamil language. Contains Text files.

Access the dataset

Tamilmixsentiment Dataset

This Tamil-English code-switched, sentiment-annotated dataset contains 15,744 comment posts from YouTube. This makes the largest general domain sentiment dataset for this relatively low-resource language with a code-mixing phenomenon. Each comment/post is annotated with sentiment polarity at the comment/post level. This dataset also has class imbalance problems depicting real-world scenarios.

Access the dataset

ChAII Dataset

The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions written by native-speaking expert data annotators.

Access the dataset

CVIT PIB Dataset

This sentence-aligned parallel corpus spans 10 Indian Languages – Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English. The corpora are compiled from online sources which have content shared across languages. Alongside, the dataset reports on methods of constructing corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network-based methods.

Access the dataset

Wrapping up

To conclude, here are top picks for the best Tamil language datasets for your projects:

  1. CC100-Tamil Dataset
  2. Tamilmixsentiment Dataset
  3. ChAII Dataset
  4. CVIT PIB Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you would like to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.