The Best Romanian Language Datasets of 2022

Romanian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Romanian language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Romanian Language datasets.

Are you ready?

Let’s dive in.

Here are our top picks for Romanian Language datasets:

CC100-Romanian Dataset

Created in 2020, the CC100-Romanian dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 16G, exclusively in the Romanian language. Contains Text files.

Access the dataset

MuST-C Dataset

MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, including Romanian. The corpus consists of audio, transcriptions, and translations of English TED talks, and it comes with a predefined training, validation, and test split.

Access the dataset

Europarl-ST Dataset

Europarl-ST is a multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012.

Access the dataset

MOROCO (Moldavian and Romanian Dialectal Corpus)

The Moldavian and Romanian Dialectal Corpus (MOROCO) is a corpus that contains 33,564 samples of text (with over 10 million tokens) collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports, and tech. The dataset is divided into 21,719 samples for training, 5,921 samples for validation, and another 5,924 samples for testing.

Access the dataset

WMT 2016 News Dataset

This dataset is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, Romanian, Russian, Turkish) and additional 1500 sentences from each of the 5 languages translated to English. For Romanian, a third of the dataset was released as a development set. For Turkish additional 500 sentence development set was released. The sentences were selected from dozens of news websites and translated by professional translators. 

Access the dataset

Wrapping up

To conclude, here are top picks for the best Romanian language datasets for your projects:

  1. CC100-Romanian Dataset
  2. MuST-C Dataset
  3. Europarl-ST Dataset
  4. MOROCO Dataset
  5. WMT 2016 News Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you would like to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.