CAFE: Spontaneous code-switching speech dataset in Algerian dialect, French and English

  • Houssam Eddine Othman Lachemat*
  • , Abbas Akli
  • , Nourredine Oukas
  • , Yassine El Kheir
  • , Samia Haboussi
  • , Shammur Absar Chowdhury
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Publicly available datasets capturing spontaneous multilingual speech—especially those involving code-switching between Algerian Arabic, French, and English—are critically scarce. This lack of resources hinders the development of automatic speech recognition (ASR) and multilingual NLP systems for low-resource languages and under-represented Arabic dialects. We introduce CAFE, a novel dataset comprising approximately 37 h of spontaneous, in vivo human–human conversations among 100+ speakers across Algeria. The dialogues cover diverse everyday topics such as sports, science, and technology, and exhibit a rich range of natural conversational phenomena, including explicit code-switching, overlapping speech, non-lexical vocalizations (e.g., laughter, fillers, ambient noise), and dialectal variation reflecting Algeria’s sociolinguistic landscape. CAFE is released in two tiers: CAFE-small (2 h 36 m): A fully human-annotated subset with high-quality transcriptions, vocal event labels, and linguistic annotations, supporting ASR evaluation, NLP tasks, and code-switching analysis. CAFE-large (∼34 h 35 m): The remainder of the corpus, automatically labeled, suitable for pretraining and semi-supervised learning. To support controlled experiments, CAFE-small includes two curated subsets: (i) CAFE-small-clean (2 h 18 m): Contains utterances with no overlapping speech. (ii) CAFE-small-overlap (17 m): Contains 23 files with overlap segments and timestamps. The dataset also provides rich metadata, including audio chunk IDs, dialect labels, and both raw and linguistically processed transcripts. CAFE offers a valuable resource for advancing ASR, dialect identification, and sociolinguistic analysis in multilingual and low-resource settings.

Original languageEnglish
Article number112150
JournalData in Brief
Volume63
DOIs
Publication statusPublished - Dec 2025

Keywords

  • Automatic speech recognition
  • Dialect identification
  • Linguistic annotation
  • Low-resource languages
  • Overlapping speech
  • Pseudo-labeling
  • Speech segmentation

Fingerprint

Dive into the research topics of 'CAFE: Spontaneous code-switching speech dataset in Algerian dialect, French and English'. Together they form a unique fingerprint.

Cite this