Evaluating Uncertainty and Robustness in Vision-Based Models

Kahl, Kim-Celine

[thumbnail of 2026-01-15_Kahl_Thesis.pdf]

Preview

PDF, English
Download (77MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00038578
URN: urn:nbn:de:bsz:16-heidok-385784

Abstract

Performance metrics such as accuracy are widely used to evaluate machine learning models, yet they provide only a limited view of model behavior. Two models may achieve identical accuracy while differing substantially in their reliability, for example when one model degrades under distribution shifts or produces overconfident errors. As machine learning systems are increasingly deployed in safety-critical domains such as medical imaging, these discrepancies highlight the need to move beyond narrow in-distribution performance metrics and toward systematic assessments of robustness and uncertainty. This thesis addresses this need by developing two evaluation frameworks that enable rigorous analyses of model behavior. First, a framework for uncertainty estimation in semantic segmentation is developed. Here, different uncertainty types are explicitly modeled in the data and metrics, key components that contribute to the performance of uncertainty methods are identified, and evaluation is performed across diverse downstream tasks. Using this framework, two empirical studies reveal that (1) the separation between uncertainty types, while achievable in controlled settings, is not guaranteed for real-world data; (2) theoretical intuitions guide the choice of uncertainty measures, but dataset properties also influence outcomes; and (3) ensembles provide a robust choice across different datasets and tasks. Second, a framework for robustness evaluation of Vision Language Models (VLMs) in medical visual question answering is developed. This incorporates realistic distribution shifts, semantically meaningful evaluation metrics, and sanity baselines that contextualize model performance. Applied in an empirical robustness study, the framework shows that (1) no fine-tuning method consistently outperforms others; (2) LoRA provides stable in-distribution performance; and (3) the variance in robustness is larger between shifts than between methods. Together, these frameworks advance the evaluation of trustworthy machine learning by providing structured methodologies for assessing robustness and uncertainty. Their general design principles and empirical findings offer guidance for practitioners and form a foundation for future research in developing reliable and transparent models for high-stakes applications.

Document type:	Dissertation
Supervisor:	Maier-Hein, Prof. Dr. Klaus
Place of Publication:	Heidelberg
Date of thesis defense:	24 April 2026
Date Deposited:	28 Apr 2026 10:50
Date:	2026
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Department of Computer Science
DDC-classification:	004 Data processing Computer science