6 Best Japanese Language Datasets for NLP and Machine Learning

Japanese is one of the most commonly spoken languages in the world, and demand for high-quality Japanese language datasets has exploded with the growth of NLP and generative AI.

The problem? It’s still surprisingly hard to find the right Japanese datasets, especially if you need a specific domain, dialect, or type of speech to train your models.

That’s why we’ve done the heavy lifting for you. At Twine AI, we’ve searched high and low to find some of the best Japanese NLP datasets you can use for machine translation, document layout understanding, question answering, and more.

Are you ready?

Let’s dive into our list of the best Japanese language datasets you can use and beyond.


Best Japanese language datasets for NLP and machine learning

Here are our top picks for Japanese language datasets you can start using right away:

1. PheMT Dataset

Created by Fujii in 2020, the PheMT Dataset extends the MTNT dataset with additional annotations for four linguistic phenomena: proper nouns, abbreviated nouns, colloquial expressions, and spelling variants. It covers both Japanese and English, and is delivered in TSV format with 1,566 annotated entries.

This makes PheMT especially useful if you’re building Japanese NLP models that need to handle informal or noisy user-generated text (for example, social media comments or forum posts).

Access the dataset

2. CC100-Japanese Dataset

Created by Conneau & Wenzek in 2020, the CC100-Japanese corpus is part of the CC100 collection: 100 monolingual datasets extracted from Common Crawl snapshots (January–December 2018). The Japanese portion is around 15 GB of cleaned web text in Japanese, provided in plain text format.

If you’re training large-scale language models or pre-training encoders, this is one of the biggest openly available Japanese text datasets you can start from.

Access the dataset

3. HJDataset

Created by Shen in 2020, HJDataset contains over 250,000 layout element annotations across seven different element types in Japanese documents. The dataset is available in JSON format and is designed for document layout analysis and understanding.

This is a strong fit if you’re working on Japanese document OCR, page segmentation, or building models that need to understand the structure of reports, forms, or scanned documents.

Access the dataset

4. Business Scene Dialogue (BSD) Dataset

Created by Rikters in 2020, the Business Scene Dialogue (BSD) Dataset contains 955 business scenarios and around 30,000 parallel sentences in English–Japanese. It focuses on realistic business conversations, making it ideal for training machine translation systems or dialogue models in professional contexts. The data is available in JSON format.

If you’re building chatbots, translation tools, or conversation agents for Japanese business use cases, BSD is a highly relevant Japanese language dataset to start with.

Access the dataset

5. MNIST Dataset

The MNIST database of handwritten digits contains a training set of 60,000 examples and a test set of 10,000 examples. It is a subset of a larger dataset from NIST, where the digits have been size-normalized and centered in fixed-size images. MNIST is available in simple binary files that are easy to parse.

While this isn’t a Japanese language dataset, it’s still a useful baseline for testing computer vision, OCR, and pattern recognition methods before you move on to more complex Japanese-specific handwriting or character datasets.

Access the dataset

6. JaQuAD: Japanese Question Answering Dataset

Released in 2022, JaQuAD (Japanese Question Answering Dataset) is a human-annotated dataset for Japanese machine reading comprehension. It’s designed as a SQuAD-style QA dataset in Japanese, with 39,696 question–answer pairs. Questions and answers are manually curated by human annotators, and the contexts are sourced from Japanese Wikipedia articles. The dataset is provided in text/JSON format.

JaQuAD is ideal if you’re training Japanese QA models or evaluating how well your Japanese language model can read, understand, and answer questions about real-world text.

Access the dataset


Wrapping up

To recap, here are our top picks for Japanese language datasets you can use across different NLP tasks:

  1. PheMT Dataset
  2. C100-Japanese Dataset
  3. HJDataset
  4. Business Scene Dialogue (BSD) Dataset
  5. MNIST Dataset
  6. JaQuAD: Japanese Question Answering Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.