TY - JOUR
T1 - PathVLM-Eval
T2 - Evaluation of open vision language models in histopathology
AU - Gilal, Nauman Ullah
AU - Zegour, Rachida
AU - Al-Thelaya, Khaled
AU - Özer, Erdener
AU - Agus, Marco
AU - Schneider, Jens
AU - Boughorbel, Sabri
N1 - Publisher Copyright:
© 2025 The Authors
PY - 2025/6/5
Y1 - 2025/6/5
N2 - The emerging trend of vision language models (VLMs) has introduced a new paradigm in artificial intelligence (AI). However, their evaluation has predominantly focused on general-purpose datasets, providing a limited understanding of their effectiveness in specialized domains. Medical imaging, particularly digital pathology, could significantly benefit from VLMs for histological interpretation and diagnosis, enabling pathologists to use a complementary tool for faster morecomprehensive reporting and efficient healthcare service. In this work, we are interested in benchmarking VLMs on histopathology image understanding. We present an extensive evaluation of recent VLMs on the PathMMU dataset, a domain-specific benchmark that includes subsets such as PubMed, SocialPath, and EduContent. These datasets feature diverse formats, notably multiple-choice questions (MCQs), designed to aid pathologists in diagnostic reasoning and support professional development initiatives in histopathology. Utilizing VLMEvalKit, a widely used open-source evaluation framework—we bring publicly available pathology datasets under a single evaluation umbrella, ensuring unbiased and contamination-free assessments of model performance. Our study conducts extensive zero-shot evaluations of more than 60 state-of-the-art VLMs, including LLaVA, Qwen-VL, Qwen2-VL, InternVL, Phi3, Llama3, MOLMO, and XComposer series, significantly expanding the range of evaluated models compared to prior literature. Among the tested models, Qwen2-VL-72B-Instruct achieved superior performance with an average score of 63.97% outperforming other models across all PathMMU subsets. We conclude that this extensive evaluation will serve as a valuable resource, fostering the development of next-generation VLMs for analyzing digital pathology images. Additionally, we have released the complete evaluation results on our leaderboard PathVLM-Eval: https://huggingface.co/spaces/gilalnauman/PathVLMs.
AB - The emerging trend of vision language models (VLMs) has introduced a new paradigm in artificial intelligence (AI). However, their evaluation has predominantly focused on general-purpose datasets, providing a limited understanding of their effectiveness in specialized domains. Medical imaging, particularly digital pathology, could significantly benefit from VLMs for histological interpretation and diagnosis, enabling pathologists to use a complementary tool for faster morecomprehensive reporting and efficient healthcare service. In this work, we are interested in benchmarking VLMs on histopathology image understanding. We present an extensive evaluation of recent VLMs on the PathMMU dataset, a domain-specific benchmark that includes subsets such as PubMed, SocialPath, and EduContent. These datasets feature diverse formats, notably multiple-choice questions (MCQs), designed to aid pathologists in diagnostic reasoning and support professional development initiatives in histopathology. Utilizing VLMEvalKit, a widely used open-source evaluation framework—we bring publicly available pathology datasets under a single evaluation umbrella, ensuring unbiased and contamination-free assessments of model performance. Our study conducts extensive zero-shot evaluations of more than 60 state-of-the-art VLMs, including LLaVA, Qwen-VL, Qwen2-VL, InternVL, Phi3, Llama3, MOLMO, and XComposer series, significantly expanding the range of evaluated models compared to prior literature. Among the tested models, Qwen2-VL-72B-Instruct achieved superior performance with an average score of 63.97% outperforming other models across all PathMMU subsets. We conclude that this extensive evaluation will serve as a valuable resource, fostering the development of next-generation VLMs for analyzing digital pathology images. Additionally, we have released the complete evaluation results on our leaderboard PathVLM-Eval: https://huggingface.co/spaces/gilalnauman/PathVLMs.
KW - LLMs benchmarking
KW - Pathology
KW - VLMs
KW - Zero-shot evaluation
UR - https://www.scopus.com/pages/publications/105010028237
U2 - 10.1016/j.jpi.2025.100455
DO - 10.1016/j.jpi.2025.100455
M3 - Article
AN - SCOPUS:105010028237
SN - 2229-5089
VL - 18
JO - Journal of Pathology Informatics
JF - Journal of Pathology Informatics
M1 - 100455
ER -