TY - GEN
T1 - SAFE
T2 - 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
AU - Abdaljalil, Samir
AU - Pallucchini, Filippo
AU - Seveso, Andrea
AU - Kurban, Hasan
AU - Mercorio, Fabio
AU - Serpedin, Erchin
N1 - Publisher Copyright:
©2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Despite their state-of-the-art capabilities, Large Language Models (LLMs) often suffer from hallucinations, which can compromise their reliability in critical applications. In this work, we propose SAFE, a novel framework for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across four diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.
AB - Despite their state-of-the-art capabilities, Large Language Models (LLMs) often suffer from hallucinations, which can compromise their reliability in critical applications. In this work, we propose SAFE, a novel framework for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across four diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.
UR - https://www.scopus.com/pages/publications/105028974377
U2 - 10.18653/v1/2025.findings-emnlp.496
DO - 10.18653/v1/2025.findings-emnlp.496
M3 - Conference contribution
AN - SCOPUS:105028974377
T3 - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
SP - 9335
EP - 9346
BT - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
A2 - Christodoulopoulos, Christos
A2 - Chakraborty, Tanmoy
A2 - Rose, Carolyn
A2 - Peng, Violet
PB - Association for Computational Linguistics (ACL)
Y2 - 4 November 2025 through 9 November 2025
ER -