TY - GEN
T1 - Wikidata as a Source of Demographic Information
AU - Abdaljalil, Samir
AU - Mubarak, Hamdy
N1 - Publisher Copyright:
©2024 Association for Computational Linguistics.
PY - 2024/8/16
Y1 - 2024/8/16
N2 - Names carry important information about our identities and demographics such as gender, nationality, ethnicity, etc. We investigate the use of individual’s name, in both Arabic and English, to predict important attributes, namely country, region, gender, and language. We extract data from Wikidata, and normalize it, to build a comprehensive dataset consisting of more than 1 million entities and their normalized attributes. We experiment with a Linear SVM approach, as well as two Transformers approaches consisting of BERT model fine-tuning and Transformers pipeline. Our results indicate that we can predict the gender, language and region using the name only with an accuracy of over 0.65. The country attribute can be predicted with less accuracy. The Linear SVM approach outperforms the other approaches for all the attributes. The best performing approach was also evaluated on another dataset that consists of 1,500 names from 15 countries (covering different regions) extracted from Twitter, and yields similar results. We share the datasets used in our experiments in addition to an online interface for testing and API calling.
AB - Names carry important information about our identities and demographics such as gender, nationality, ethnicity, etc. We investigate the use of individual’s name, in both Arabic and English, to predict important attributes, namely country, region, gender, and language. We extract data from Wikidata, and normalize it, to build a comprehensive dataset consisting of more than 1 million entities and their normalized attributes. We experiment with a Linear SVM approach, as well as two Transformers approaches consisting of BERT model fine-tuning and Transformers pipeline. Our results indicate that we can predict the gender, language and region using the name only with an accuracy of over 0.65. The country attribute can be predicted with less accuracy. The Linear SVM approach outperforms the other approaches for all the attributes. The best performing approach was also evaluated on another dataset that consists of 1,500 names from 15 countries (covering different regions) extracted from Twitter, and yields similar results. We share the datasets used in our experiments in addition to an online interface for testing and API calling.
UR - https://www.scopus.com/pages/publications/85204311212
U2 - 10.18653/v1/2024.arabicnlp-1.1
DO - 10.18653/v1/2024.arabicnlp-1.1
M3 - Conference contribution
AN - SCOPUS:85204311212
T3 - ArabicNLP 2024 - 2nd Arabic Natural Language Processing Conference, Proceedings of the Conference
SP - 1
EP - 10
BT - ArabicNLP 2024 - 2nd Arabic Natural Language Processing Conference, Proceedings of the Conference
A2 - Habash, Nizar
A2 - Bouamor, Houda
A2 - Eskander, Ramy
A2 - Tomeh, Nadi
A2 - Farha, Ibrahim Abu
A2 - Abdelali, Ahmed
A2 - Touileb, Samia
A2 - Hamed, Injy
A2 - Onaizan, Yaser
A2 - Alhafni, Bashar
A2 - Antoun, Wissam
A2 - Khalifa, Salam
A2 - Haddad, Hatem
A2 - Zitouni, Imed
A2 - AlKhamissi, Badr
A2 - Almatham, Rawan
A2 - Mrini, Khalil
PB - Association for Computational Linguistics (ACL)
T2 - 2nd Arabic Natural Language Processing Conference, ArabicNLP 2024
Y2 - 16 August 2024
ER -