Casablanca: Data and Models for Multidialectal Arabic Speech Recognition

  • Bashar Talafha*
  • , Karima Kadaoui
  • , Samar M. Magdy
  • , Mariem Habiboullah
  • , Chafei Mohamed
  • , Ahmed O. El-Shangiti
  • , Hiba Zayed
  • , Mohamedou Cheikh Tourad
  • , Rahaf Alhamouri
  • , Rwaa Assi
  • , Aisha Alraeesi
  • , Hoor Mohamed
  • , Fakhraddin Alwajih
  • , Abdelrahman Mohamed
  • , Abdellah El Mekki
  • , El Moatez Billah Nagoudi
  • , Saadia Benelhadj
  • , Hamzah A. Alsayadi
  • , Walid Al-Dhabyani
  • , Sara Shatnawi
  • Yasir Ech-Chammakhy, Amal Makouar, Yousra Berrachedi, Mustafa Jarrar, Shady Shehata, Ismail Berrada, Muhammad Abdul-Mageed*
*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by presenting Casablanca, a large-scale community-driven effort to collect and transcribe a multi-dialectal Arabic dataset. The dataset covers eight dialects: Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni, and includes annotations for transcription, gender, dialect, and code-switching. We also develop a number of strong baselines exploiting Casablanca. The project page for Casablanca is accessible at: https://www.dlnlp.ai/speech/casablanca.

Original languageEnglish
Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
PublisherAssociation for Computational Linguistics (ACL)
Pages21745-21758
Number of pages14
ISBN (Electronic)9798891761643
DOIs
Publication statusPublished - 2024
Externally publishedYes
Event2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States
Duration: 12 Nov 202416 Nov 2024

Publication series

NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Country/TerritoryUnited States
CityHybrid, Miami
Period12/11/2416/11/24

Fingerprint

Dive into the research topics of 'Casablanca: Data and Models for Multidialectal Arabic Speech Recognition'. Together they form a unique fingerprint.

Cite this