The Top Mandarin Chinese Language Datasets of 2022

Mandarin is one of the most commonly spoken dialects in the world. That being said, it’s not always easy to find Mandarin Chinese language datasets to train your models.

That’s why we’ve done the hard bit for you. Here at Twine, we’ve searched high and low to find the best Mandarin Chinese Language datasets.

Are you ready?

Let’s dive in.

Here are our top picks for Mandarin Chinese Language datasets:

1. AISHELL-1 Dataset

AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin.

Access the dataset

2. AISHELL-3 Dataset

AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that is used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and a total of 88035 utterances. Their auxiliary attributes such as gender, age group, and native accents are explicitly marked and provided in the corpus.

Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. The word & tone transcription accuracy rate is above 98%, through professional speech annotation and strict quality inspection for tone and prosody.

Access the dataset

3. WenetSpeech Dataset

WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours of high-quality labeled speech, 2,400+ hours of weakly labeled speech, and about 10,000 hours of unlabeled speech, with 22,400+ hours in total.

The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.

Access the dataset

4. MAGICDATA Mandarin Chinese Read Speech Corpus

The corpus by Magic Data Technology Co., Ltd., contains 755 hours of scripted read speech data from 1080 native speakers of Mandarin Chinese spoken in mainland China. The sentence transcription accuracy is higher than 98%. Recordings are conducted in a quiet indoor environment. The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command, control, etc.

Access the dataset

5. LATIC Dataset

LACTIC is an annotated non-native speech database for Chinese, which is fully open-source. The related using area can be automatic speech scoring, evaluation, derivation—L2 teaching, Education of Chinese as a Foreign Language, etc. The dataset aims to provide a relatively small-scale and highly efficient training deviation dataset. For this target, four chosen non-native Chinese speakers participated in this project, and their mother tongue (L1s) varies from Russian, Korean, French, and Arabic. It outputs a 1-hour testing audio file (valid recording) for each tester, which has 4 hours of materials.

Access the dataset

6. CMLR Dataset

CMLR dataset was collected by the Visual Intelligence and Pattern Analysis (VIPA) group of Zhejiang University. It was designed to facilitate research on visual speech recognition, sometimes also referred to as automatic lip-reading. The dataset consists of 102,072 spoken sentences from 11 speakers, recorded between June 2009 and June 2018 from the national news program “News Broadcast”. Each sentence is up to 29 Chinese characters in length and does not contain English letters, Arabic numerals, and rare punctuation.

Access the dataset

Wrapping up

To conclude, here are top picks for the best Mandarin Chinese Language Speech datasets for your projects:

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

AI datasets machine learning

The Top Mandarin Chinese Language Datasets of 2022

Here are our top picks for Mandarin Chinese Language datasets:

1. AISHELL-1 Dataset

2. AISHELL-3 Dataset

3. WenetSpeech Dataset

4. MAGICDATA Mandarin Chinese Read Speech Corpus

5. LATIC Dataset

6. CMLR Dataset

Wrapping up

Ready to learn more? Check out our Dataset Archives:

Twine AI

Best Data Collection Companies for AI

LLM Evaluation Rubrics: Templates, Examples, and Reviewer Calibration

How to Write an LLM Evaluation Rubric

The Top Mandarin Chinese Language Datasets of 2022

Here are our top picks for Mandarin Chinese Language datasets:

1. AISHELL-1 Dataset

2. AISHELL-3 Dataset

3. WenetSpeech Dataset

4. MAGICDATA Mandarin Chinese Read Speech Corpus

5. LATIC Dataset

6. CMLR Dataset

Wrapping up

Ready to learn more? Check out our Dataset Archives:

Twine AI

You may also like

Best Data Collection Companies for AI

LLM Evaluation Rubrics: Templates, Examples, and Reviewer Calibration

How to Write an LLM Evaluation Rubric

Need audio training data?