With big streams of data written in dialectal Arabic from social medias, researchers_x000D_
shifted their focus from Modern Standard Arabic (MSA) into dialectal Arabic. Some_x000D_
researchers have also left the rich text-mining tools library tailored for MSA behind_x000D_
and started developing dialect-specific tools from scratch. Meanwhile, other researchers_x000D_
have chose to invest in utilizing the existed MSA tools by extending their validity to dialects._x000D_
Regardless of the decision a researcher made in dealing with the Arabic dialects,_x000D_
the first challenge will always remain the same: How to identify the Arabic variant(s)_x000D_
the data is written in?_x000D_
The dialect identification task is classically approached by hiring human annotators._x000D_
Multiple annotators are commonly assigned for labeling each sentence in order to maintain_x000D_
good accuracy. The needed time and cost to finish the task are directly proportional_x000D_
to the size of data. Baring on mind the big size of on-line data, using the classical method_x000D_
is not very practical. In this paper, a recent machine-based approach is explored. The_x000D_
dataset employed is an open-source dialectal dataset which is labeled using source information._x000D_
Features are sub-word tokens extracted with a trained BPE-based segmentation_x000D_
model. A separate n-gram model is trained for each dialect appeared in the dataset._x000D_
When a new sequence of features is passed to the system, each dialectal language model_x000D_
scores the sequence. The sequence will be labeled with the dialect corresponds to the_x000D_
model with the highest probabilistic score._x000D_
By incorporating segmentation in the dialect identification framework, 5 points improvement_x000D_
were yielded over baseline results. Similar performance was maintained_x000D_
when applied for out of domain test datasets. Therefore, for distinguishing between_x000D_
closely related languages with morphological differences like Arabic dialects, segmentation_x000D_
could help in extracting frequent language-specific sub-word features and reducing_x000D_
data sparsity.
| Date of Award | 2018 |
|---|
| Original language | American English |
|---|
| Awarding Institution | - HBKU College of Science and Engineering
|
|---|
Arabic Dialect Identification
Al-Mannai, K. (Author). 2018
Student thesis: Master's Dissertation