Active Learning for Multidialectal Arabic POS Tagging

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Multidialectal Arabic POS tagging is challenging due to the morphological richness and high variability among dialects. While POS tagging for MSA has advanced thanks to the availability of annotated datasets, creating similar resources for dialects remains costly and labor-intensive. Increasing the size of annotated datasets does not necessarily result in better performance. Active learning offers a more efficient alternative by prioritizing annotating the most informative samples. This paper proposes an active learning approach for multidialectal Arabic POS tagging. Our experiments revealed that annotating approximately 15, 000 tokens is sufficient for high performance. We further demonstrate that using a fine-tuned model from one dialect to guide the selection of initial samples from another dialect accelerates convergence—reducing the annotation requirement by about 2, 000 tokens. In conclusion, we propose an active learning pipeline and demonstrate that, upon reaching its defined stopping point of 16, 000 annotated tokens, it achieves an accuracy of 97.6% on the Emirati Corpus.

Original languageEnglish
Title of host publicationEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
EditorsChristos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
PublisherAssociation for Computational Linguistics (ACL)
Pages24960-24973
Number of pages14
ISBN (Electronic)9798891763357
DOIs
Publication statusPublished - Nov 2025
Event30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - Suzhou, China
Duration: 4 Nov 20259 Nov 2025

Publication series

NameEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025

Conference

Conference30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
Country/TerritoryChina
CitySuzhou
Period4/11/259/11/25

Fingerprint

Dive into the research topics of 'Active Learning for Multidialectal Arabic POS Tagging'. Together they form a unique fingerprint.

Cite this