Wikidata as a Source of Demographic Information

  • Samir Abdaljalil*
  • , Hamdy Mubarak
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Names carry important information about our identities and demographics such as gender, nationality, ethnicity, etc. We investigate the use of individual’s name, in both Arabic and English, to predict important attributes, namely country, region, gender, and language. We extract data from Wikidata, and normalize it, to build a comprehensive dataset consisting of more than 1 million entities and their normalized attributes. We experiment with a Linear SVM approach, as well as two Transformers approaches consisting of BERT model fine-tuning and Transformers pipeline. Our results indicate that we can predict the gender, language and region using the name only with an accuracy of over 0.65. The country attribute can be predicted with less accuracy. The Linear SVM approach outperforms the other approaches for all the attributes. The best performing approach was also evaluated on another dataset that consists of 1,500 names from 15 countries (covering different regions) extracted from Twitter, and yields similar results. We share the datasets used in our experiments in addition to an online interface for testing and API calling.

Original languageEnglish
Title of host publicationArabicNLP 2024 - 2nd Arabic Natural Language Processing Conference, Proceedings of the Conference
EditorsNizar Habash, Houda Bouamor, Ramy Eskander, Nadi Tomeh, Ibrahim Abu Farha, Ahmed Abdelali, Samia Touileb, Injy Hamed, Yaser Onaizan, Bashar Alhafni, Wissam Antoun, Salam Khalifa, Hatem Haddad, Imed Zitouni, Badr AlKhamissi, Rawan Almatham, Khalil Mrini
PublisherAssociation for Computational Linguistics (ACL)
Pages1-10
Number of pages10
ISBN (Electronic)9798891761322
DOIs
Publication statusPublished - 16 Aug 2024
Event2nd Arabic Natural Language Processing Conference, ArabicNLP 2024 - Bangkok, Thailand
Duration: 16 Aug 2024 → …

Publication series

NameArabicNLP 2024 - 2nd Arabic Natural Language Processing Conference, Proceedings of the Conference

Conference

Conference2nd Arabic Natural Language Processing Conference, ArabicNLP 2024
Country/TerritoryThailand
CityBangkok
Period16/08/24 → …

Fingerprint

Dive into the research topics of 'Wikidata as a Source of Demographic Information'. Together they form a unique fingerprint.

Cite this