Alma: Fast Lemmatizer and POS Tagger for Arabic

  • Mustafa Jarrar*
  • , Diyam Akra
  • , Tymaa Hammouda
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

2 Citations (Scopus)

Abstract

We introduce Alma, an open-source and state-of-the-art lemmatizer, POS tagger, and root tagger for Arabic, boasting both high speed and accuracy. Alma relies on a dictionary of morphological solutions ordered by the frequency of these solutions. This dictionary was developed based on the Qabas lexicographic database. Unlike many Arabic lemmatizers that return a lemma after stripping diacritics, shadda, and hamza (i.e., ambiguous lemma), Alma retrieves unambiguous lemmas (we called true lemmatization). Our POS tagger uses a rich tagset of 40 POS tags. Additionally, our root tagger is the first fully-featured tagger since it uses Qabas, the largest Arabic lexicographic database. We evaluated Alma on the LDC Arabic Treebank (ATB) that contains 339,710 tokens and achieved an 88% F1 score. We also evaluated Alma on the Salma corpus (34k tokens) and obtained a 90% F1 score. Compared to Farasa, MADAMIRA, and Camelira lemmatizers and POS taggers, Alma outperformed all of them in both tasks, excelling in both speed and accuracy. Alma demonstrated superior processing speed, handling 339k tokens in 10.00. Alma is open-source and publicly available at (https://sina.birzeit.edu/alma).

Original languageEnglish
Pages (from-to)378-387
Number of pages10
JournalProcedia Computer Science
Volume244
DOIs
Publication statusPublished - 2024
Externally publishedYes
Event6th International Conference on AI in Computational Linguistics, ACLing 2024 - Hybrid, Dubai, United Arab Emirates
Duration: 21 Sept 202422 Sept 2024

Keywords

  • Arabic
  • Arabic Morphology
  • Lemma
  • Lemmatizer
  • morphology tagging
  • Part of Speech
  • POS
  • POS Tagger
  • Root
  • Root Tagger

Fingerprint

Dive into the research topics of 'Alma: Fast Lemmatizer and POS Tagger for Arabic'. Together they form a unique fingerprint.

Cite this