TY - GEN
T1 - Active Learning for Multidialectal Arabic POS Tagging
AU - Akra, Diyam
AU - Khalilia, Mohammed
AU - Jarrar, Mustafa
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025/11
Y1 - 2025/11
N2 - Multidialectal Arabic POS tagging is challenging due to the morphological richness and high variability among dialects. While POS tagging for MSA has advanced thanks to the availability of annotated datasets, creating similar resources for dialects remains costly and labor-intensive. Increasing the size of annotated datasets does not necessarily result in better performance. Active learning offers a more efficient alternative by prioritizing annotating the most informative samples. This paper proposes an active learning approach for multidialectal Arabic POS tagging. Our experiments revealed that annotating approximately 15, 000 tokens is sufficient for high performance. We further demonstrate that using a fine-tuned model from one dialect to guide the selection of initial samples from another dialect accelerates convergence—reducing the annotation requirement by about 2, 000 tokens. In conclusion, we propose an active learning pipeline and demonstrate that, upon reaching its defined stopping point of 16, 000 annotated tokens, it achieves an accuracy of 97.6% on the Emirati Corpus.
AB - Multidialectal Arabic POS tagging is challenging due to the morphological richness and high variability among dialects. While POS tagging for MSA has advanced thanks to the availability of annotated datasets, creating similar resources for dialects remains costly and labor-intensive. Increasing the size of annotated datasets does not necessarily result in better performance. Active learning offers a more efficient alternative by prioritizing annotating the most informative samples. This paper proposes an active learning approach for multidialectal Arabic POS tagging. Our experiments revealed that annotating approximately 15, 000 tokens is sufficient for high performance. We further demonstrate that using a fine-tuned model from one dialect to guide the selection of initial samples from another dialect accelerates convergence—reducing the annotation requirement by about 2, 000 tokens. In conclusion, we propose an active learning pipeline and demonstrate that, upon reaching its defined stopping point of 16, 000 annotated tokens, it achieves an accuracy of 97.6% on the Emirati Corpus.
UR - https://www.scopus.com/pages/publications/105028993458
U2 - 10.18653/v1/2025.findings-emnlp.1359
DO - 10.18653/v1/2025.findings-emnlp.1359
M3 - Conference contribution
AN - SCOPUS:105028993458
T3 - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
SP - 24960
EP - 24973
BT - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
A2 - Christodoulopoulos, Christos
A2 - Chakraborty, Tanmoy
A2 - Rose, Carolyn
A2 - Peng, Violet
PB - Association for Computational Linguistics (ACL)
T2 - 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
Y2 - 4 November 2025 through 9 November 2025
ER -