TY - GEN
T1 - The Art of Saying “Maybe”
T2 - 19th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2026
AU - Azad, Asif
AU - Hossain, Mohammad Sadat
AU - Sadik Hossain Shanto, M. D.
AU - Saifur Rahman, M.
AU - Parvez, Md Rizwan
N1 - Publisher Copyright:
©2026 Association for Computational Linguistics.
PY - 2026
Y1 - 2026
N2 - Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 18 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. For closed-source models lacking token-level logprob access, we develop and validate instruction-guided likelihood proxies. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don’t know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.
AB - Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 18 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. For closed-source models lacking token-level logprob access, we develop and validate instruction-guided likelihood proxies. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don’t know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.
UR - https://www.scopus.com/pages/publications/105039040886
U2 - 10.18653/v1/2026.findings-eacl.274
DO - 10.18653/v1/2026.findings-eacl.274
M3 - Conference contribution
AN - SCOPUS:105039040886
T3 - 19th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2026
SP - 5185
EP - 5201
BT - 19th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2026
PB - Association for Computational Linguistics (ACL)
Y2 - 24 March 2026 through 29 March 2026
ER -