Curras + Baladi: Towards a Levantine Corpus

  • Karim El Haff
  • , Mustafa Jarrar*
  • , Tymaa Hammouda
  • , Fadi Zaraket
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

23 Citations (Scopus)

Abstract

This paper presents two-fold contributions: a full revision of the Palestinian morphologically annotated corpus (Curras), and a newly annotated Lebanese corpus (Baladi). Both corpora can be used as a more general Levantine corpus. Baladi consists of around 9.6K morphologically annotated tokens. Each token was manually annotated with several morphological features and using LDC's SAMA lemmas and tags. The inter-annotator evaluation on most features illustrates 78.5% Kappa and 90.1% F1-Score. Curras was revised by refining all annotations for accuracy, normalization and unification of POS tags, and linking with SAMA lemmas. This revision was also important to ensure that both corpora are compatible and can help to bridge the nuanced linguistic gaps that exist between the two highly mutually intelligible dialects. Both corpora are publicly available through a web portal.

Original languageEnglish
Title of host publication2022 Language Resources and Evaluation Conference, LREC 2022
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages769-778
Number of pages10
ISBN (Electronic)9791095546726
Publication statusPublished - 2022
Externally publishedYes
Event13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France
Duration: 20 Jun 202225 Jun 2022

Publication series

Name2022 Language Resources and Evaluation Conference, LREC 2022

Conference

Conference13th International Conference on Language Resources and Evaluation Conference, LREC 2022
Country/TerritoryFrance
CityMarseille
Period20/06/2225/06/22

Keywords

  • Annotated Corpus
  • Arabic Dialect
  • Arabic morphology
  • Lebanese Arabic
  • Levantine
  • Palestinian Arabic

Fingerprint

Dive into the research topics of 'Curras + Baladi: Towards a Levantine Corpus'. Together they form a unique fingerprint.

Cite this