The Best Hungarian Language Datasets of 2022

Hungarian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Hungarian language datasets to train your models.

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Hungarian Language datasets.

Are you ready?

Let’s dive in.

Here are our top picks for Hungarian Language datasets:

Hungarian Language Corpora

The Hunglish Corpus is a sentence-aligned Hungarian-English parallel corpus published under the Creative Commons Attribution license. The Hungarian Web corpus is a corpus of the Hungarian language gathered from the web.

Access the dataset

CC100-Hungarian Dataset

Created in 2020, the CC100-Hungarian dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G, exclusively in the Hungarian language. Contains Text files.

Access the dataset

CSS10 Hungarian Dataset

CSS10 is a collection of single-speaker speech datasets for 10 languages. Each of them consists of audio files recorded by a single volunteer and their aligned text sourced from LibriVox. Exclusively in the Hungarian language, contains text files.

Access the dataset

Wrapping up

To conclude, here are top picks for the best Hungarian language datasets for your projects:

We hope that this list has either helped you find a dataset for your project or, realize the myriad options available.

Please let us know if there are any datasets you would like us to add to the list.

If you want to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

AI datasets machine learning

The Best Hungarian Language Datasets of 2022

Here are our top picks for Hungarian Language datasets:

Hungarian Language Corpora

CC100-Hungarian Dataset

CSS10 Hungarian Dataset

Wrapping up

Ready to learn more? Check out our Dataset Archives:

Twine AI

Best Data Collection Companies for AI

LLM Evaluation Rubrics: Templates, Examples, and Reviewer Calibration

How to Write an LLM Evaluation Rubric

The Best Hungarian Language Datasets of 2022

Here are our top picks for Hungarian Language datasets:

Hungarian Language Corpora

CC100-Hungarian Dataset

CSS10 Hungarian Dataset

Wrapping up

Ready to learn more? Check out our Dataset Archives:

Twine AI

You may also like

Best Data Collection Companies for AI

LLM Evaluation Rubrics: Templates, Examples, and Reviewer Calibration

How to Write an LLM Evaluation Rubric

Need audio training data?