TY - GEN
T1 - A Systematic Survey and Critical Review on Evaluating Large Language Models
T2 - 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
AU - Laskar, Md Tahmid Rahman
AU - Alqahtani, Sawsan
AU - Bari, M. Saiful
AU - Rahman, Mizanur
AU - Khan, Mohammad Abdullah Matin
AU - Khan, Haidar
AU - Jahan, Israt
AU - Bhuiyan, Md Amran Hossen
AU - Tan, Chee Wei
AU - Parvez, Md Rizwan
AU - Hoque, Enamul
AU - Joty, Shafiq
AU - Huang, Jimmy Xiangji
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024/7/4
Y1 - 2024/7/4
N2 - Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
AB - Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
UR - https://www.scopus.com/pages/publications/85215453548
U2 - 10.48550/arXiv.2407.04069
DO - 10.48550/arXiv.2407.04069
M3 - Conference contribution
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
SP - 13785
EP - 13816
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
Y2 - 12 November 2024 through 16 November 2024
ER -