Top French Language Datasets of 2022

At Twine, we specialize in helping companies create high-quality custom audio and video datasets.  

We often get asked if there are any off-the-shelf audio and video datasets we would recommend – both for testing, and, for them to use as custom approaches.

So we’ve ransacked the web to find only the top French Language datasets, so you don’t have to. 

Are you ready? Let’s dive into our list of the best French Language datasets.

Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here


Here are our top picks for French Language Datasets:

1. Biggest Non-Commercial French Language Dataset

The SIWIS French Speech Synthesis Database includes high-quality French speech recordings and associated text files, aimed at building TTS systems, investigating multiple styles, and emphasis. Various sources such as parliament debates and novels were uttered by a professional French voice talent. A subset of the database contains emphasized words in many different contexts. 

Features:

  • 9750 utterances from various sources
  • more than ten hours of speech data
  • freely available

Access the dataset here

Not quite your style? Check out these alternatives:

  • The CC100-French This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 14G., in French language. 
  • The Translation-Augmented-LibriSpeech-Corpus (Libri-Trans) Dataset is an augmentation of LibriSpeech ASR and contains English utterances (from audiobooks) automatically aligned with French text. It offers 236h of speech aligned to translated text., in French, English language. Containing 236 Hours in Text, WAV file format.
  • BREF-120 consists of about 50-60 sentences per speaker and recordings conducted only with a Shure microphone. In BREF-80, the sentences were chosen to cover as many prompts as possible.

2. Best Child Adult Interaction French Language Dataset

The project “Treatment of Oral Corpus in French” (TCOF) was born from the desire to preserve oral corpora collected in the 80s and 90s for personal research purposes. 

Features:

  • 626 transcriptions (Transcriber and WAV), with a total duration of 146 hours for 1,542,562 words

Access the dataset here

3. Best Canadian French Language Dataset

The Canadian French Emotional (CaFE) speech dataset contains six different sentences, pronounced by six male and six female actors, in six basic emotions plus one neutral emotion. The six basic emotions are acted in two different intensities. 

Features:

  • The audio is digitally recorded at high-resolution (192 kHz sampling rate, 24 bits per sample). 
  • Freely available under a Creative Commons license (CC BY-NC-SA 4.0).

Access the dataset here

Alternatives:

  • The CALLFRIEND Canadian French dataset consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).

4. Best French Native Reading Comprehension dataset

FQuAD is a French Native Reading Comprehension dataset that consists of 25,000+ questions created by higher education students on a set of Wikipedia articles. 

Features:

  • Over 120 articles

Access the dataset here

5. Best Scripted French Language Dataset

The French Scripted Speech Corpus dataset consists of 325 hours of transcribed French scripted speech focusing on daily-use sentences, news, command and query, and keyword spotting.

Features:

  • Contributions by 489 speakers
  • Recorded on mobile devices in quiet, indoor environments
  • WAV (PCM) 16 kHz, 16 bits, mono

Access the dataset here


Wrapping up

To conclude, here are top picks for the best French Language datasets:

  1. Biggest Non-Commercial French Language Dataset – SIWIS French Speech Synthesis Database
  2. Best Child Adult Interaction French Language Dataset – Treatment of Oral Corpus in French
  3. Best Canadian French Language Dataset – The Canadian French Emotional Dataset
  4. Best French Native Reading Comprehension dataset – FQuAD
  5. Best Scripted French Language Dataset – The French Scripted Speech Corpus

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.