TY - GEN
T1 - Multi-dialect Arabic POS tagging
T2 - 11th International Conference on Language Resources and Evaluation, LREC 2018
AU - Darwish, Kareem
AU - Mubarak, Hamdy
AU - Eldesouki, Mohamed
AU - Abdelali, Ahmed
AU - Samih, Younes
AU - Alharbi, Randah
AU - Attia, Mohammed
AU - Magdy, Walid
AU - Kallmeyer, Laura
N1 - Publisher Copyright:
© LREC 2018 - 11th International Conference on Language Resources and Evaluation. All rights reserved.
PY - 2018
Y1 - 2018
N2 - This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional Random Fields (CRF) sequence labeler to train POS taggers for each dialect and examine the effect of cross and joint dialect training, and give benchmark results for the datasets. Using clitic n-grams, clitic metatypes, and stem templates as features, we were able to train a joint model that can correctly tag four different dialects with an average accuracy of 89.3%.
AB - This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional Random Fields (CRF) sequence labeler to train POS taggers for each dialect and examine the effect of cross and joint dialect training, and give benchmark results for the datasets. Using clitic n-grams, clitic metatypes, and stem templates as features, we were able to train a joint model that can correctly tag four different dialects with an average accuracy of 89.3%.
KW - Arabic dialects
KW - CRF
KW - POS tagging
UR - https://www.scopus.com/pages/publications/85058117794
M3 - Conference contribution
AN - SCOPUS:85058117794
T3 - LREC 2018 - 11th International Conference on Language Resources and Evaluation
SP - 93
EP - 98
BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation
A2 - Calzolari, Nicoletta
A2 - Choukri, Khalid
A2 - Cieri, Christopher
A2 - Declerck, Thierry
A2 - Goggi, Sara
A2 - Hasida, Koiti
A2 - Isahara, Hitoshi
A2 - Maegaard, Bente
A2 - Mariani, Joseph
A2 - Mazo, Helene
A2 - Moreno, Asuncion
A2 - Odijk, Jan
A2 - Piperidis, Stelios
A2 - Tokunaga, Takenobu
PB - European Language Resources Association (ELRA)
Y2 - 7 May 2018 through 12 May 2018
ER -