The Best Persian Language Datasets of 2022

Persian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Persian language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Persian Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Persian Language datasets:

CC100-Persian Dataset

Created in 2020, the CC100-Persian dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 20G, exclusively in the Persian language. Contains text files.

Access the dataset

PerKey Dataset

Created by Doostmohammadi in 2020, the PerKey Dataset contains 553K news articles from six Persian news websites and agencies with author-extracted keyphrases. This is then filtered and cleaned to achieve higher quality keyphrases, exclusively in the Persian language. Contains 553,111 JSON files.

Access the dataset

Perlex Dataset

Created by Asgari-Bidhendi in 2020, the Perlex Dataset is an expert-translated version of the Semeval-2010-Task-8 dataset. Exclusively in the Persian language. Contains 10,717 files.

Access the dataset

Persian Language Sentiment Instagram Analysis Dataset

This dataset represents work on producing Insta-text, which is an Instagram comments Persian language sentiment analysis. In this study, about 111,000 Instagram comments have been scrapped and about 9,000 of them have been labeled using the crowdsourcing method. Word2vec model also has been used to validate the dataset.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Persian language datasets for your projects:

  1. CC100-Persian Dataset
  2. PerKey Dataset
  3. Perlex Dataset
  4. Persian Language Sentiment Instagram Analysis Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you would like to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.


Fatal error

: Uncaught Error: Call to undefined function Smush\Core\Parser\str_starts_with() in /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/parser/class-image-url.php:119 Stack trace: #0 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/parser/class-image-url.php(98): Smush\Core\Parser\Image_URL->is_scheme_missing_from_original() #1 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/parser/class-image-url.php(91): Smush\Core\Parser\Image_URL->prepare_absolute_url() #2 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/lazy-load/class-lazy-load-transform.php(352): Smush\Core\Parser\Image_URL->get_absolute_url() #3 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/lazy-load/class-lazy-load-transform.php(312): Smush\Core\Lazy_Load\Lazy_Load_Transform->maybe_lazy_load_image_element() #4 /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/lazy-load/class-lazy-load-transform.php(304): Smush\Core\Lazy_Load\Lazy_Load_Transform->transform_image_element() #5 /var/ww in /var/www/html/wordpress/wp-content/plugins/wp-smush-pro/core/parser/class-image-url.php on line 119