Abstract
Publicly available datasets capturing spontaneous multilingual speech—especially those involving code-switching between Algerian Arabic, French, and English—are critically scarce. This lack of resources hinders the development of automatic speech recognition (ASR) and multilingual NLP systems for low-resource languages and under-represented Arabic dialects. We introduce CAFE, a novel dataset comprising approximately 37 h of spontaneous, in vivo human–human conversations among 100+ speakers across Algeria. The dialogues cover diverse everyday topics such as sports, science, and technology, and exhibit a rich range of natural conversational phenomena, including explicit code-switching, overlapping speech, non-lexical vocalizations (e.g., laughter, fillers, ambient noise), and dialectal variation reflecting Algeria’s sociolinguistic landscape. CAFE is released in two tiers: CAFE-small (2 h 36 m): A fully human-annotated subset with high-quality transcriptions, vocal event labels, and linguistic annotations, supporting ASR evaluation, NLP tasks, and code-switching analysis. CAFE-large (∼34 h 35 m): The remainder of the corpus, automatically labeled, suitable for pretraining and semi-supervised learning. To support controlled experiments, CAFE-small includes two curated subsets: (i) CAFE-small-clean (2 h 18 m): Contains utterances with no overlapping speech. (ii) CAFE-small-overlap (17 m): Contains 23 files with overlap segments and timestamps. The dataset also provides rich metadata, including audio chunk IDs, dialect labels, and both raw and linguistically processed transcripts. CAFE offers a valuable resource for advancing ASR, dialect identification, and sociolinguistic analysis in multilingual and low-resource settings.
| Original language | English |
|---|---|
| Article number | 112150 |
| Journal | Data in Brief |
| Volume | 63 |
| DOIs | |
| Publication status | Published - Dec 2025 |
Keywords
- Automatic speech recognition
- Dialect identification
- Linguistic annotation
- Low-resource languages
- Overlapping speech
- Pseudo-labeling
- Speech segmentation
Fingerprint
Dive into the research topics of 'CAFE: Spontaneous code-switching speech dataset in Algerian dialect, French and English'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver