German is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find german language datasets with a specific dialect or type of speech to train your models.
That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best German Language datasets.
Are you ready?
Let’s dive into our list of the best German Language datasets in 2022.
Here are our top picks for German Language datasets:
1. Multi30k Dataset
Created by Elliott et al. in 2016, the Multi30k Dataset of images paired with sentences in English and German. This dataset extends the Flickr30K dataset., in German, and English language. Containing 31,014 in n/a file format.
2. WebNLG (Enriched) Dataset
Created by Gardent et al. in 2017, the WebNLG (Enriched) Dataset consists of 25,298 (data, text) pairs and 9,674 distinct data units. The data units are sets of RDF triples extracted from DBPedia and the texts are sequences of one or more sentences verbalizing these data units., in German, and English language. Containing 25,298 in XML file format.
3. Ten Thousand German News Articles Dataset (10kGNAD) Dataset
Created by Timo Block in 2019, the Ten Thousand German News Articles Dataset (10kGNAD) Dataset consists of 10273 german language news articles from an Austrian online newspaper categorized into nine topics., in German language. Containing 10,273 in CSV file format.
4. CC100-German Dataset
Created by Conneau & Wenzek et al. in 2020, the CC100-German This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 18G., in the German language. Containing n/a in Text file format.
5. GermEval 2014 NER Shared Task Dataset
Created by Benikova et al. in 2014, the GermEval 2014 NER Shared Task The data was sampled from German Wikipedia and News Corpora as a collection of citations. The dataset covers over 31,000 sentences corresponding to over 590,000 tokens., in the German language. Containing 31,000+ in TSV file format.
6. Event-focused Emotion Corpora for German and English Dataset
Created by Troiano et al. in 2019, the Event-focused Emotion Corpora for German and English German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources., in German, English language. Contains 2,002 in TSV file format.
To conclude, here are top picks for the best German Language datasets for your projects:
- Multi30k Dataset
- WebNLG (Enriched) Dataset
- Ten Thousand German News Articles Dataset (10kGNAD) Dataset
- CC100-German Dataset
- GermEval 2014 NER Shared Task Dataset
- Event-focused Emotion Corpora for German and English Dataset
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.