TY - GEN
T1 - Casablanca
T2 - 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
AU - Talafha, Bashar
AU - Kadaoui, Karima
AU - Magdy, Samar M.
AU - Habiboullah, Mariem
AU - Mohamed, Chafei
AU - El-Shangiti, Ahmed O.
AU - Zayed, Hiba
AU - Tourad, Mohamedou Cheikh
AU - Alhamouri, Rahaf
AU - Assi, Rwaa
AU - Alraeesi, Aisha
AU - Mohamed, Hoor
AU - Alwajih, Fakhraddin
AU - Mohamed, Abdelrahman
AU - El Mekki, Abdellah
AU - Nagoudi, El Moatez Billah
AU - Benelhadj, Saadia
AU - Alsayadi, Hamzah A.
AU - Al-Dhabyani, Walid
AU - Shatnawi, Sara
AU - Ech-Chammakhy, Yasir
AU - Makouar, Amal
AU - Berrachedi, Yousra
AU - Jarrar, Mustafa
AU - Shehata, Shady
AU - Berrada, Ismail
AU - Abdul-Mageed, Muhammad
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by presenting Casablanca, a large-scale community-driven effort to collect and transcribe a multi-dialectal Arabic dataset. The dataset covers eight dialects: Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni, and includes annotations for transcription, gender, dialect, and code-switching. We also develop a number of strong baselines exploiting Casablanca. The project page for Casablanca is accessible at: https://www.dlnlp.ai/speech/casablanca.
AB - In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by presenting Casablanca, a large-scale community-driven effort to collect and transcribe a multi-dialectal Arabic dataset. The dataset covers eight dialects: Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni, and includes annotations for transcription, gender, dialect, and code-switching. We also develop a number of strong baselines exploiting Casablanca. The project page for Casablanca is accessible at: https://www.dlnlp.ai/speech/casablanca.
UR - https://www.scopus.com/pages/publications/85214730620
U2 - 10.18653/v1/2024.emnlp-main.1211
DO - 10.18653/v1/2024.emnlp-main.1211
M3 - Conference contribution
AN - SCOPUS:85214730620
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
SP - 21745
EP - 21758
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
Y2 - 12 November 2024 through 16 November 2024
ER -