Audio Datasets

VoxCeleb

VoxCeleb is a large-scale audio-visual speech dataset built from YouTube interview clips, widely used to train and benchmark deep speaker recognition models for speaker verification, speaker identification, and robust “in-the-wild” voice AI.

Mandarin (Shanghai) (China) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Mandarin spoken in Shanghai, China

Romanian (Romania) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Romanian spoken in Romania

Polish (Poland) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Polish spoken in Poland

Panjabi (Pakistan) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Panjabi spoken in Pakistan

Mongolian (Mongolia) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Mongolian spoken in Mongolia

Mandarin (Traditional) (Taiwan) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Mandarin (Traditional) spoken in Taiwan

Mandarin (Simplified) (China) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Mandarin (simplified) spoken in China

Lao (Laos) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Lao spoken in Laos

Kannada (India) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Kannada spoken in India

Greek (Greece) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Spanish spoken in Spain

German (Switzerland) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, German spoken in Switzerland

French (Algeria) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, French spoken in Algeria

Farsi/Persian (Iran) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Farsi/Persian spoken in Iran

English (UAE) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, English spoken in UAE

English (Philippines) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, English spoken in Philippines

English (Hong Kong) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, English spoken in Hong Kong

English (Australia) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, English spoken in Australia

Dutch (Netherland) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Dutch spoken in Netherland

Dutch (Belgium) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Dutch spoken in Belgium

Spanish (Mexico) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Spanish spoken in Mexico

Spanish (ESP) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Spanish spoken in Spain

Catalan (Spain) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Catalan spoken in Catalonia, Spain.

Sinhalese (Sri Lanka) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Sinhalese spoken in Sri Lanka

Vietnamese General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Vietnamese spoken in Vietnam

Tamil General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Tamil spoken in India

Singaporean-English General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Singaporean-English spoken in Singapore

Punjabi General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Punjabi spoken in India

Malay General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Malay spoken in Malaysia

Bahasa (Indonesia) General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Bahasa spoken in Indonesia

Thai General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Thai spoken in Thailand

Gujarati General Conversation data

Unscripted conversation between two people. Approx. Audio Duration (Range) - 15-60 minutes, Gujarati spoken in India.

UK Accents Dataset

Various UK Accents

Arabic (Saudi Arabia) language conversational telephony

Dataset is fully transcribed and timestamped.

Arabic (Dubai) language conversational telephony

Dataset is fully transcribed and timestamped.

UK Voice Commands Dataset

Voice Commands in the English Language

Portuguese (PT) language conversational telephony

Dataset is fully transcribed and timestamped.

German language conversational telephony

Dataset is fully transcribed and timestamped.

Bengali (Bangladesh) conversational telephony

Dataset is fully transcribed and timestamped.

Phone Conversations in Hindi

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in Japanese

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in Spanish

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in French

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in Indian English

The data set includes 2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two speakers.

Phone Conversations in Irish English

2 hours of time-stamped and transcribed unscripted speech data (i.e. natural conversation) between two Irish speakers

Alexa Wake Words in Canadian French (Adults)

This data set contains recordings of the wake word "Alexa" in Canadian French (fr_CA) (e.g., "Alexa, raconte-moi une blague.").

Alexa Voice Commands in EU Spanish (Adults)

Wake word "Alexa" in EU Spanish (es_ES) (e.g., "Eh, Alexa, cuéntame un chiste."). Each participant has recorded on average 70 utterances (minimum 50, maximum 75). This data set contains the voice command only.

Alexa Wake Words in Mexican Spanish (Adults)

This data set contains recordings of the wake word "Alexa" in Mexican Spanish (es_MX) used in voice commands (e.g., "Oye Alexa, cuéntame un chiste.").

Wake Words

Siri Wake Words and Voice Commands in US English

US English voice commands including the wake word "Hey Siri" from 103 participants of age 19-68.

Wake Words and Voice Commands in Korean with Seoul Dialect

Korean voice commands including the wake word "Hi Bixby" from 52 participants in Seoul.