Arabic Dialect Identification

  • Kamela Al-Mannai

Student thesis: Master's Dissertation

Abstract

With big streams of data written in dialectal Arabic from social medias, researchers_x000D_ shifted their focus from Modern Standard Arabic (MSA) into dialectal Arabic. Some_x000D_ researchers have also left the rich text-mining tools library tailored for MSA behind_x000D_ and started developing dialect-specific tools from scratch. Meanwhile, other researchers_x000D_ have chose to invest in utilizing the existed MSA tools by extending their validity to dialects._x000D_ Regardless of the decision a researcher made in dealing with the Arabic dialects,_x000D_ the first challenge will always remain the same: How to identify the Arabic variant(s)_x000D_ the data is written in?_x000D_ The dialect identification task is classically approached by hiring human annotators._x000D_ Multiple annotators are commonly assigned for labeling each sentence in order to maintain_x000D_ good accuracy. The needed time and cost to finish the task are directly proportional_x000D_ to the size of data. Baring on mind the big size of on-line data, using the classical method_x000D_ is not very practical. In this paper, a recent machine-based approach is explored. The_x000D_ dataset employed is an open-source dialectal dataset which is labeled using source information._x000D_ Features are sub-word tokens extracted with a trained BPE-based segmentation_x000D_ model. A separate n-gram model is trained for each dialect appeared in the dataset._x000D_ When a new sequence of features is passed to the system, each dialectal language model_x000D_ scores the sequence. The sequence will be labeled with the dialect corresponds to the_x000D_ model with the highest probabilistic score._x000D_ By incorporating segmentation in the dialect identification framework, 5 points improvement_x000D_ were yielded over baseline results. Similar performance was maintained_x000D_ when applied for out of domain test datasets. Therefore, for distinguishing between_x000D_ closely related languages with morphological differences like Arabic dialects, segmentation_x000D_ could help in extracting frequent language-specific sub-word features and reducing_x000D_ data sparsity.
Date of Award2018
Original languageAmerican English
Awarding Institution
  • HBKU College of Science and Engineering

Keywords

  • None

Cite this

'