The Best Korean Language Datasets of 2022

Korean is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Korean language datasets to train your models. 

That’s why we’ve done the hard bit for you. We’ve searched high and low here at Twine to find the best Korean Language datasets.

Are you ready?

Let’s dive in.


Here are our top picks for Korean Language datasets:

CC100-Korean Dataset

Created in 2020, the CC100-Korean This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G, exclusively in the Korean language. Contains Text files.

Access the dataset

Intonation-Aided Intention Identification for Korean (3i4K) Dataset

Created in 2018, the Intonation-Aided Intention Identification for Korean (3i4K) Dataset contains seven class annotated corpus of single text utterances/intents in conversation. Exclusively in the Korean language. Contains 61 Text files.

Access the dataset

Korean Hate Speech Dataset Dataset

Created in 2020, the Korean Hate Speech Dataset Dataset contains 9,4K manually labeled entertainment news comments for identifying Korean toxic speech. Exclusively in the Korean language. Containing 9,381Text files.

Access the dataset

Korean Single Speaker Dataset (KSS) Dataset

Created in 2019, the Korean Single Speaker Dataset (KSS) Dataset consists of audio files recorded by a professional female voice actress and their aligned text extracted from books. Exclusively in the Korean language. Contains 12,853 WAV files.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Korean language datasets for your projects:

  1. CC100-Korean Dataset
  2. Intonation-Aided Intention Identification for Korean (3i4K) Dataset
  3. Korean Hate Speech Dataset Dataset
  4. Korean Single Speaker Dataset (KSS) Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available. 

Please let us know if there are any datasets you would like us to add to the list.

If you would like to learn more about how we could help build a custom dataset for your project, don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.