Arabic Diacritization: Stats, Rules, and Hacks

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

61 Citations (Scopus)

Abstract

In this paper, we present a new and fast state-of-the-art Arabic diacritizer that guesses the diacritics of words and then their case endings. We employ a Viterbi decoder at word-level with back-off to stem, morphological patterns, and transliteration and sequence labeling based diacritization of named entities. For case endings, we use Support Vector Machine (SVM) based ranking coupled with morphological patterns and linguistic rules to properly guess case endings. We achieve a low word level diacritization error of 3.29% and 12.77% without and with case endings respectively on a new multi-genre free of copyright test set. We are making the diacritizer available for free for research purposes.

Original languageEnglish
Title of host publicationWANLP 2017, co-located with EACL 2017 - 3rd Arabic Natural Language Processing Workshop, Proceedings of the Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages9-17
Number of pages9
ISBN (Electronic)9781945626449
DOIs
Publication statusPublished - 2017
Event3rd Arabic Natural Language Processing Workshop, WANLP 2017 held at EACL 2017 - Valencia, Spain
Duration: 3 Apr 2017 → …

Publication series

NameWANLP 2017, co-located with EACL 2017 - 3rd Arabic Natural Language Processing Workshop, Proceedings of the Workshop

Conference

Conference3rd Arabic Natural Language Processing Workshop, WANLP 2017 held at EACL 2017
Country/TerritorySpain
CityValencia
Period3/04/17 → …

Fingerprint

Dive into the research topics of 'Arabic Diacritization: Stats, Rules, and Hacks'. Together they form a unique fingerprint.

Cite this