Hungarian is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Hungarian language datasets to train your models.
That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Hungarian Language datasets.
Are you ready?
Let’s dive in.
Here are our top picks for Hungarian Language datasets:
Hungarian Language Corpora
The Hunglish Corpus is a sentence-aligned Hungarian-English parallel corpus published under the Creative Commons Attribution license. The Hungarian Web corpus is a corpus of the Hungarian language gathered from the web.
Created in 2020, the CC100-Hungarian dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G, exclusively in the Hungarian language. Contains Text files.
CSS10 Hungarian Dataset
CSS10 is a collection of single-speaker speech datasets for 10 languages. Each of them consists of audio files recorded by a single volunteer and their aligned text sourced from LibriVox. Exclusively in the Hungarian language, contains text files.
To conclude, here are top picks for the best Hungarian language datasets for your projects:
We hope that this list has either helped you find a dataset for your project or, realize the myriad options available.
Please let us know if there are any datasets you would like us to add to the list.
If you want to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.