Abstract
We introduce Alma, an open-source and state-of-the-art lemmatizer, POS tagger, and root tagger for Arabic, boasting both high speed and accuracy. Alma relies on a dictionary of morphological solutions ordered by the frequency of these solutions. This dictionary was developed based on the Qabas lexicographic database. Unlike many Arabic lemmatizers that return a lemma after stripping diacritics, shadda, and hamza (i.e., ambiguous lemma), Alma retrieves unambiguous lemmas (we called true lemmatization). Our POS tagger uses a rich tagset of 40 POS tags. Additionally, our root tagger is the first fully-featured tagger since it uses Qabas, the largest Arabic lexicographic database. We evaluated Alma on the LDC Arabic Treebank (ATB) that contains 339,710 tokens and achieved an 88% F1 score. We also evaluated Alma on the Salma corpus (34k tokens) and obtained a 90% F1 score. Compared to Farasa, MADAMIRA, and Camelira lemmatizers and POS taggers, Alma outperformed all of them in both tasks, excelling in both speed and accuracy. Alma demonstrated superior processing speed, handling 339k tokens in 10.00. Alma is open-source and publicly available at (https://sina.birzeit.edu/alma).
| Original language | English |
|---|---|
| Pages (from-to) | 378-387 |
| Number of pages | 10 |
| Journal | Procedia Computer Science |
| Volume | 244 |
| DOIs | |
| Publication status | Published - 2024 |
| Externally published | Yes |
| Event | 6th International Conference on AI in Computational Linguistics, ACLing 2024 - Hybrid, Dubai, United Arab Emirates Duration: 21 Sept 2024 → 22 Sept 2024 |
Keywords
- Arabic
- Arabic Morphology
- Lemma
- Lemmatizer
- morphology tagging
- Part of Speech
- POS
- POS Tagger
- Root
- Root Tagger