AraSafe: Benchmarking Safety in Arabic LLMs

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We introduce AraSafe, the first large-scale native Arabic safety benchmark for large language models (LLMs), addressing the pressing need for culturally and linguistically representative evaluation resources. The dataset comprises 12K naturally occurring, human-written Arabic prompts containing both harmful and non-harmful content across diverse domains, including linguistics, social studies, and science. Each prompt was independently annotated by two experts into one of nine fine-grained safety categories, including’Safe/Not Harmful’, ‘Illegal Activities’, ‘Violence or Harm’, ‘Privacy Violation’, and ‘Hate Speech’. Additionally, to support training classifiers for harmful content and due to the imbalanced representation of harmful content in the natural dataset, we create a synthetic dataset of additional 12K harmful prompts generated by GPT-4o via carefully designed prompt engineering techniques. We benchmark a number of Arabic-centric and multilingual models in the 7 to 13B parameter range, including Jais, AceGPT, Allam, Fanar, Llama-3, Gemma-2, and Qwen3, as well as BERT-based fine-tuned classifier models on detecting harmful prompts. GPT-4o was used as an upper-bound reference baseline. Our evaluation reveals critical safety blind spots in Arabic LLMs and underscores the necessity of localized, culturally grounded benchmarks for building responsible AI systems.1

Original languageEnglish
Title of host publicationEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
EditorsChristos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
PublisherAssociation for Computational Linguistics (ACL)
Pages9976-9992
Number of pages17
ISBN (Electronic)9798891763357
DOIs
Publication statusPublished - Nov 2025
Event30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - Suzhou, China
Duration: 4 Nov 20259 Nov 2025

Publication series

NameEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025

Conference

Conference30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
Country/TerritoryChina
CitySuzhou
Period4/11/259/11/25

Fingerprint

Dive into the research topics of 'AraSafe: Benchmarking Safety in Arabic LLMs'. Together they form a unique fingerprint.

Cite this