Preview |
PDF, English
Download (77MB) | Terms of use |
Abstract
Performance metrics such as accuracy are widely used to evaluate machine learning models, yet they provide only a limited view of model behavior. Two models may achieve identical accuracy while differing substantially in their reliability, for example when one model degrades under distribution shifts or produces overconfident errors. As machine learning systems are increasingly deployed in safety-critical domains such as medical imaging, these discrepancies highlight the need to move beyond narrow in-distribution performance metrics and toward systematic assessments of robustness and uncertainty. This thesis addresses this need by developing two evaluation frameworks that enable rigorous analyses of model behavior. First, a framework for uncertainty estimation in semantic segmentation is developed. Here, different uncertainty types are explicitly modeled in the data and metrics, key components that contribute to the performance of uncertainty methods are identified, and evaluation is performed across diverse downstream tasks. Using this framework, two empirical studies reveal that (1) the separation between uncertainty types, while achievable in controlled settings, is not guaranteed for real-world data; (2) theoretical intuitions guide the choice of uncertainty measures, but dataset properties also influence outcomes; and (3) ensembles provide a robust choice across different datasets and tasks. Second, a framework for robustness evaluation of Vision Language Models (VLMs) in medical visual question answering is developed. This incorporates realistic distribution shifts, semantically meaningful evaluation metrics, and sanity baselines that contextualize model performance. Applied in an empirical robustness study, the framework shows that (1) no fine-tuning method consistently outperforms others; (2) LoRA provides stable in-distribution performance; and (3) the variance in robustness is larger between shifts than between methods. Together, these frameworks advance the evaluation of trustworthy machine learning by providing structured methodologies for assessing robustness and uncertainty. Their general design principles and empirical findings offer guidance for practitioners and form a foundation for future research in developing reliable and transparent models for high-stakes applications.
| Document type: | Dissertation |
|---|---|
| Supervisor: | Maier-Hein, Prof. Dr. Klaus |
| Place of Publication: | Heidelberg |
| Date of thesis defense: | 24 April 2026 |
| Date Deposited: | 28 Apr 2026 10:50 |
| Date: | 2026 |
| Faculties / Institutes: | The Faculty of Mathematics and Computer Science > Department of Computer Science |
| DDC-classification: | 004 Data processing Computer science |







