Using Twitter to Collect a Multi-DiaCorpus of Arabic

Hamdy Mubarak, Kareem Darwish

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

77 Citations (Scopus)

Abstract

This paper describes the collection and classification of a multi-dialectal corpus of Arabic based on the geographical information of tweets. We mapped information of user locations to one of the Arab countries, and extracted tweets that have dialectal word(s). Manual evaluation of the extracted corpus shows that the accuracy of assignment of tweets to some countries (like Saudi Arabia and Egypt) is above 93% while the accuracy for other countries, such Algeria and Syria is below 70%.

Original languageEnglish
Title of host publicationANLP 2014 - EMNLP 2014 Workshop on Arabic Natural Language Processing, Proceedings
EditorsNizar Habash, Stephan Vogel
PublisherAssociation for Computational Linguistics (ACL)
Pages1-7
Number of pages7
ISBN (Electronic)9781937284961
Publication statusPublished - 2014
EventEMNLP 2014 Workshop on Arabic Natural Language Processing, ANLP 2014 - Doha, Qatar
Duration: 25 Oct 2014 → …

Publication series

NameANLP 2014 - EMNLP 2014 Workshop on Arabic Natural Language Processing, Proceedings

Conference

ConferenceEMNLP 2014 Workshop on Arabic Natural Language Processing, ANLP 2014
Country/TerritoryQatar
CityDoha
Period25/10/14 → …

Fingerprint

Dive into the research topics of 'Using Twitter to Collect a Multi-DiaCorpus of Arabic'. Together they form a unique fingerprint.

Cite this