6 Best Japanese Language Datasets of 2022

Japanese is one of the most commonly spoken languages in the world. That being said, it’s not always easy to find Japanese language datasets with a specific dialect or type of speech to train your models. 

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Japanese Language datasets.

Are you ready?

Let’s dive into our list of the best Japanese Language datasets in 2022.


Here are our top picks for Japanese Language datasets:

1. PheMT Dataset

Created by Fujii in 2020, the PheMT Dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena: Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant., in Japanese, English language. Containing 1,566 in TSV file format.

Access the dataset

2. CC100-Japanese Dataset

Created by Conneau & Wenzek in 2020, the CC100-Japanese This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G., in the Japanese language. In Text file format.

Access the dataset

3. HJDataset

Created by Shen in 2020, the HJDataset Dataset contains over 250,000 layout element annotations of seven types in Japanese documents., in the Japanese language. Containing 250,000+ in JSON file format.

Access the dataset

4. Business Scene Dialogue (BSD) Dataset

Created by Rikters in 2020, the Business Scene Dialogue (BSD) Dataset contains 955 scenarios and 30,000 parallel sentences in English-Japanese. Japanese, and English language. Containing 30 in JSON file format.

Access the dataset

5. MNIST Dataset

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal effort on preprocessing and formatting. In Text file format.

Access the dataset

6. JaQuAD: Japanese Question Answering Dataset

Japanese Question Answering Dataset (JaQuAD), released in 2022, is a human-annotated dataset created for Japanese Machine Reading Comprehension. JaQuAD is developed to provide a SQuAD-like QA dataset in Japanese. JaQuAD contains 39,696 question-answer pairs. Questions and answers are manually curated by human annotators. Contexts are collected from Japanese Wikipedia articles, in Text file format.

Access the dataset


Wrapping up

To conclude, here are top picks for the best Japanese Language datasets for your projects:

  1. PheMT Dataset
  2. C100-Japanese Dataset
  3. HJDataset
  4. Business Scene Dialogue (BSD) Dataset
  5. MNIST Dataset
  6. JaQuAD: Japanese Question Answering Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.