Skip to main navigation Skip to search Skip to main content

Evaluation of Question Answering Systems: Complexity of Judging a Natural Language

  • Tampere University

Research output: Contribution to journalReview articlepeer-review

Abstract

Question answering (QA) systems are a leading and rapidly advancing field of natural language processing (NLP) research. One of their key advantages is that they enable more natural interactions between humans and machines, such as in virtual assistants or search engines. Over the past few decades, many QA systems have been developed to handle diverse QA tasks. However, the evaluation of these systems is intricate, as many of the available evaluation scores are not task-agnostic. Furthermore, translating human judgment into measurable metrics continues to be an open issue. These complexities add challenges to their assessment. This survey provides a systematic overview of evaluation scores and introduces a taxonomy with two main branches: Human-Centric Evaluation Scores (HCES) and Automatic Evaluation Scores (AES). Since many of these scores were originally designed for specific tasks but have been applied more generally, we also cover the basics of QA frameworks and core paradigms to provide a deeper understanding of their capabilities and limitations. Lastly, we discuss benchmark datasets that are critical for conducting systematic evaluations across various QA tasks.

Original languageEnglish
Article number3744663
JournalACM Computing Surveys
Volume58
Issue number1
DOIs
Publication statusPublished - 30 Aug 2025
Externally publishedYes

Keywords

  • Natural language processing
  • artificial intelligence
  • deep learning
  • evaluation scores
  • neural networks
  • question answering

Fingerprint

Dive into the research topics of 'Evaluation of Question Answering Systems: Complexity of Judging a Natural Language'. Together they form a unique fingerprint.

Cite this