TY - GEN
T1 - Crowdsourcing speech and language data for resource-poor languages
AU - Mubarak, Hamdy
N1 - Publisher Copyright:
© 2018, Springer International Publishing AG.
PY - 2018
Y1 - 2018
N2 - In this paper, we present benefits of using crowdsourcing to build speech and language resources for different annotation tasks for dialectal Arabic as an example of resource-poor languages. We show recommendations for job design and quality control that allow us to build high quality data for variety of tasks. Most of these recommendations are language-independent and can be applied to other languages as well. We summarize lessons learned from experiments in data acquisition tasks, such as image annotation (transcription of Arabic historical documents), machine translation (translation from English to Hindi), speech annotation (transcription of dialectal Arabic audio files), text annotation (conversion from dialectal Arabic to Modern Standard Arabic (MSA)), and text classification (annotation of offensive language on Arabic social media, and classification of questions on Arabic medical web forums).
AB - In this paper, we present benefits of using crowdsourcing to build speech and language resources for different annotation tasks for dialectal Arabic as an example of resource-poor languages. We show recommendations for job design and quality control that allow us to build high quality data for variety of tasks. Most of these recommendations are language-independent and can be applied to other languages as well. We summarize lessons learned from experiments in data acquisition tasks, such as image annotation (transcription of Arabic historical documents), machine translation (translation from English to Hindi), speech annotation (transcription of dialectal Arabic audio files), text annotation (conversion from dialectal Arabic to Modern Standard Arabic (MSA)), and text classification (annotation of offensive language on Arabic social media, and classification of questions on Arabic medical web forums).
KW - Crowdsourcing
KW - Dialectal arabic
KW - Low-resource languages
UR - https://www.scopus.com/pages/publications/85029469138
U2 - 10.1007/978-3-319-64861-3_41
DO - 10.1007/978-3-319-64861-3_41
M3 - Conference contribution
AN - SCOPUS:85029469138
SN - 9783319648606
T3 - Advances in Intelligent Systems and Computing
SP - 440
EP - 447
BT - Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017
A2 - Tolba, Mohamed F.
A2 - Gaber, Tarek
A2 - Shaalan, Khaled
A2 - Hassanien, Aboul Ella
PB - Springer Verlag
T2 - 3rd International Conference on Advanced Intelligent Systems and Informatics, AISI 2017
Y2 - 9 September 2017 through 11 September 2017
ER -