TY - JOUR
T1 - CAFE
T2 - Spontaneous code-switching speech dataset in Algerian dialect, French and English
AU - Lachemat, Houssam Eddine Othman
AU - Akli, Abbas
AU - Oukas, Nourredine
AU - El Kheir, Yassine
AU - Haboussi, Samia
AU - Chowdhury, Shammur Absar
N1 - Publisher Copyright:
© 2025 The Author(s).
PY - 2025/12
Y1 - 2025/12
N2 - Publicly available datasets capturing spontaneous multilingual speech—especially those involving code-switching between Algerian Arabic, French, and English—are critically scarce. This lack of resources hinders the development of automatic speech recognition (ASR) and multilingual NLP systems for low-resource languages and under-represented Arabic dialects. We introduce CAFE, a novel dataset comprising approximately 37 h of spontaneous, in vivo human–human conversations among 100+ speakers across Algeria. The dialogues cover diverse everyday topics such as sports, science, and technology, and exhibit a rich range of natural conversational phenomena, including explicit code-switching, overlapping speech, non-lexical vocalizations (e.g., laughter, fillers, ambient noise), and dialectal variation reflecting Algeria’s sociolinguistic landscape. CAFE is released in two tiers: CAFE-small (2 h 36 m): A fully human-annotated subset with high-quality transcriptions, vocal event labels, and linguistic annotations, supporting ASR evaluation, NLP tasks, and code-switching analysis. CAFE-large (∼34 h 35 m): The remainder of the corpus, automatically labeled, suitable for pretraining and semi-supervised learning. To support controlled experiments, CAFE-small includes two curated subsets: (i) CAFE-small-clean (2 h 18 m): Contains utterances with no overlapping speech. (ii) CAFE-small-overlap (17 m): Contains 23 files with overlap segments and timestamps. The dataset also provides rich metadata, including audio chunk IDs, dialect labels, and both raw and linguistically processed transcripts. CAFE offers a valuable resource for advancing ASR, dialect identification, and sociolinguistic analysis in multilingual and low-resource settings.
AB - Publicly available datasets capturing spontaneous multilingual speech—especially those involving code-switching between Algerian Arabic, French, and English—are critically scarce. This lack of resources hinders the development of automatic speech recognition (ASR) and multilingual NLP systems for low-resource languages and under-represented Arabic dialects. We introduce CAFE, a novel dataset comprising approximately 37 h of spontaneous, in vivo human–human conversations among 100+ speakers across Algeria. The dialogues cover diverse everyday topics such as sports, science, and technology, and exhibit a rich range of natural conversational phenomena, including explicit code-switching, overlapping speech, non-lexical vocalizations (e.g., laughter, fillers, ambient noise), and dialectal variation reflecting Algeria’s sociolinguistic landscape. CAFE is released in two tiers: CAFE-small (2 h 36 m): A fully human-annotated subset with high-quality transcriptions, vocal event labels, and linguistic annotations, supporting ASR evaluation, NLP tasks, and code-switching analysis. CAFE-large (∼34 h 35 m): The remainder of the corpus, automatically labeled, suitable for pretraining and semi-supervised learning. To support controlled experiments, CAFE-small includes two curated subsets: (i) CAFE-small-clean (2 h 18 m): Contains utterances with no overlapping speech. (ii) CAFE-small-overlap (17 m): Contains 23 files with overlap segments and timestamps. The dataset also provides rich metadata, including audio chunk IDs, dialect labels, and both raw and linguistically processed transcripts. CAFE offers a valuable resource for advancing ASR, dialect identification, and sociolinguistic analysis in multilingual and low-resource settings.
KW - Automatic speech recognition
KW - Dialect identification
KW - Linguistic annotation
KW - Low-resource languages
KW - Overlapping speech
KW - Pseudo-labeling
KW - Speech segmentation
UR - https://www.scopus.com/pages/publications/105022703526
U2 - 10.1016/j.dib.2025.112150
DO - 10.1016/j.dib.2025.112150
M3 - Article
AN - SCOPUS:105022703526
SN - 2352-3409
VL - 63
JO - Data in Brief
JF - Data in Brief
M1 - 112150
ER -