The Best Turkish Language Datasets for NLP and Speech

Turkish is one of the most widely spoken languages in the world, but high-quality Turkish language datasets can still be surprisingly hard to find. If you’re training Turkish NLP models, building chatbots, or working on speech recognition, you need reliable text and audio data – not random scrapes from the web.

That’s why we’ve done the hard work for you. At Twine, we’ve searched high and low for the best open Turkish datasets across text, speech, and parallel corpora, so you can spend less time hunting for data and more time shipping models.

Ready to explore what’s out there and spot any gaps that might need a custom dataset? Let’s dive in.


Here are our top picks for Turkish Language datasets:

1. TS Corpus Project

TS Corpus is a Free & Independent Project that aims to build Turkish corpora, develop Natural Language Processing tools, and compile linguistic datasets. Users are free to run queries, save queries and download the hit sets to their computers. All the 14 published corpora serve a dataset of over 1.3 billion tokens derived from various sources; online newspapers, forums, social media, academic papers, etc.

Why is it useful?

  • Great starting point for large-scale Turkish NLP datasets, especially language modelling and topic modelling.
  • Multiple domains (news, forums, social media, academia) help you test how your model behaves across different registers.
  • Ideal if you want to analyse real-world Turkish usage without building your own web crawler.

Access the dataset

2. Turkish National Corpus (TNC)

Turkish National Corpus is designed to be a balanced, large-scale (50 million words) and general-purpose corpus for contemporary Turkish. It consists of samples of textual data across a wide variety of genres covering a period of 24 years (1990-2013).

The written component consists of texts produced in different domains on various topics. Transcriptions from spoken data constitute 2% of TNC’s database, which involves spontaneous, everyday conversations and speeches collected in particular communicative settings. 

Best for:
If you need a balanced Turkish corpus for linguistics, grammar studies, or evaluation, TNC is a strong choice. Its carefully curated mix of genres and time periods makes it more controlled than web-scraped data, which is useful for benchmarking and academic work.

Access the dataset

3. Bilkent Turkish Writings Dataset

This dataset contains content from Turkish creative writing courses between 2014-2018. All in all, there are nearly 7,000 texts available for download in CSV format.

Best for:

  • Building models that understand learner language, spelling errors, and informal written Turkish.
  • Training tools for automated essay scoring, grammar feedback, and educational applications.
  • Analysing style and creativity, thanks to the free-form writing prompts in the dataset.

Access the dataset

4. English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset

TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

There are two noise reduction methodologies: (a) domain-dependent and (b) domain-independent in post-processing raw collections. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences. 

Why it stands out:

  • Sentences are automatically annotated for named entities and categories, so you can jump straight into model training instead of labelling from scratch.
  • Both Turkish and English versions are available, which is handy for multilingual and cross-lingual experiments.
  • The noise-reduction strategies make it a pragmatic choice when you need scale plus reasonable label quality.

Access the dataset

5. Middle East Technical University Turkish Microphone Speech v 1.0

Middle East Technical University Turkish Microphone Speech v 1.0 was developed at Middle East Technical University (METU) and contains text, speech, and alignment files for approximately 5.6 hours of recorded Turkish.

The corpus is of a size of ~600 MB. 120 speakers (60 male and 60 female) speak 40 sentences each (approximately 300 words per speaker). The 40 sentences are selected randomly for each speaker from a triphone-balanced set of 2,462 Turkish sentences.

The speakers are selected from students, faculty, and staff at METU and all are native speakers of Turkish. The age range is from 19 to 50 years with an average of 23.9 years. The data has been digitally recorded with a Sound Blaster sound card on a PC at a 16 kHz sampling rate.

Good fit for:

  • Proof-of-concept ASR models in Turkish, where you want clean, studio-style recordings.
  • Speaker-balanced experiments, since you get 120 speakers with controlled prompts.

If you need something bigger or closer to “in-the-wild” Turkish speech, consider combining this dataset with larger open-source speech corpora like Turkish Speech Corpus or Common Voice.

Access the dataset

6. Turkish Broadcast News Speech and Transcripts

Turkish Broadcast News Speech and Transcripts was developed by Bogaziçi University, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts.

The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio the 2009 broadcasts were recorded from digital satellite transmissions.

Best for:

  • News-domain ASR, diarisation, and broadcast monitoring use cases.
  • Building models that handle formal Turkish, typical of newsreaders and political/current-affairs content.

Bear in mind that news speech is more formal than everyday conversation. For customer support, assistants, or call-centre-style use cases, you’ll want more conversational data as well.

Access the dataset

7. Turkish Speech Corpus (TSC)

This open-source corpus contains over 218 hours of transcribed Turkish speech and more than 186,000 utterances, making it one of the largest publicly available Turkish speech datasets to date.

  • Mix of speakers and recording conditions for more robust acoustic models.
  • Ideal for end-to-end ASR training, fine-tuning foundation speech models, or evaluating commercial ASR APIs in Turkish.

Access the dataset

8. Mozilla Common Voice – Turkish

Mozilla Common Voice is a massive community-driven speech project, with thousands of speakers donating their voices across many languages, including Turkish. commonvoice.mozilla.org+2Mozilla Foundation+2

  • Great for building real-world Turkish speech datasets with background noise, different accents, and microphone types.
  • Free and open under a permissive licence, which makes it attractive for commercial teams as well as academia.

Access the dataset

9. TTC-3600 Turkish Text Categorization Dataset

TTC-3600 is a benchmark dataset of 3,600 Turkish news articles, labelled into categories like economy, health, politics, sports, technology and more.

  • Designed specifically for text classification experiments in Turkish.
  • Available via the UCI Machine Learning Repository and widely used in research, so it’s a good baseline for comparing model performance.

Access the dataset

10. OPUS & OpenSubtitles Turkish Corpora

The OPUS project and OpenSubtitles collections provide huge amounts of parallel and monolingual Turkish text, often aligned with English and other languages.

  • Very helpful for machine translation, terminology extraction, and cross-lingual embeddings.
  • Subtitles bring in conversational, informal sentences that differ from news or academic text.

Access the dataset


How to Choose the Right Turkish Language Dataset

Not sure which of these Turkish language datasets is right for your project? Use these quick rules of thumb:

  • Building a chatbot or LLM in Turkish?
    Start with large text corpora like TS Corpus and TNC, then add domain-specific data (e.g. news, support tickets, product docs) if you need a particular tone or topic.
  • Working on Turkish speech recognition (ASR)?
    Combine clean, controlled datasets (METU speech corpus) with larger, more varied corpora (Turkish Speech Corpus, Common Voice Turkish, broadcast news) so models perform well in both lab and real-world environments.
  • Doing NER or classification?
    Use labelled datasets like Wikipedia NER collections and TTC-3600 as a starting point, then fine-tune on your own labelled examples so entities and categories match your product or domain.
  • Need something very specific (industry, accent, scenario)?
    Off-the-shelf datasets will only take you so far. For things like Turkish call-centre conversations in a particular sector, or highly sensitive domains, you’ll almost always need a custom Turkish dataset collected to your spec.

If you already have a sense of what you need, number of hours, accents, domains, or data types, Twine AI can help you scope and build a custom Turkish dataset that actually matches your production use case.

Wrapping up

o conclude, here are our top picks for the best Turkish language datasets for your projects:

  1. TS Corpus Project
  2. Turkish National Corpus (TNC)
  3. Bilkent Turkish Writings Dataset
  4. English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset
  5. Middle East Technical University Turkish Microphone Speech v 1.0
  6. Turkish Broadcast News Speech and Transcripts
  7. Turkish Speech Corpus (TSC)
  8. Mozilla Common Voice – Turkish
  9. TTC-3600 Turkish Text Categorization Dataset
  10. OPUS / OpenSubtitles Turkish Corpora

We hope this list has helped you cut through the noise and quickly identify which Turkish language datasets are worth your time. Whether you’re experimenting with open corpora, benchmarking models, or pressure-testing a commercial ASR system, starting with the right data can save you weeks of trial and error.

If you can’t quite find what you need, or your use case is highly specific – think particular Turkish dialects, industries, or safety-critical domains, our team at Twine can help you collect and label a custom Turkish dataset that matches your exact requirements.

Want a rough budget before you talk to anyone? Let us help you do the math – check out our AI dataset project calculator to estimate how much your dataset will cost to build.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.