From Annotation to Validation: Strengthening Machine Learning Pipelines for Real-World Robustness

Rädsch, Tim

Preview

PDF, English
Download (53MB) | Lizenz:

Creative Commons Attribution-NonCommercial 4.0

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00038235
URN: urn:nbn:de:bsz:16-heidok-382350

Abstract

Artificial Intelligence (AI) is transforming science, industry, and daily life at an unprecedented pace. Foundation Models (FMs) now shape how we search, create, and interact with information across domains from finance to healthcare. Yet, paradoxically, most AI research produced in academic and industrial labs remains confined to benchmark performance and paper metrics. The rapid increase in computational resources and in the size of the research community has contributed to this progress by fueling an unprecedented surge in publications, with venues such as the Conference on Neural Information Processing Systems (NeurIPS) exhibiting an annual growth rate of approximately 29% since 2017. This apparent progress is reflected in the continual stream of reported state-of-the-art (SOTA) results for the majority of publications. Yet, despite the vast number of proposed methods and the accompanying media attention, only a small fraction translate into real-world impact. This discrepancy raises a fundamental question: how can methodological research better align with the goal of practical translation?

This thesis addresses overlooked elements of the model development pipeline that critically influence whether research outputs achieve translation. We investigate tasks situated upstream of model training, such as labeling instructions and internal Quality Assurance (QA), as well as downstream tasks, including user-centric benchmarking. These questions are particularly urgent in healthcare, where methodological weaknesses can translate into risks for patient safety. To illustrate this challenge, we focus on surgery as a representative high-stakes domain. Across these studies, we demonstrate that design choices in these stages can and often exert a greater impact on model performance and applicability than many architectural innovations. Building on our analysis, we propose new methods and best practices for sustainable research and reliable translation.

First, we demonstrate that the majority of biomedical imaging projects does not provide any labeling instructions, even though annotators consistently consider them crucial. In the largest study on labeling instructions to date, to our knowledge, we demonstrate that visual examples are key to improving annotation quality, whereas solely extending textual descriptions yields no measurable or even negative benefit. To address these shortcomings, we provide best practices for involving annotators in both the annotation process and the construction of labeling instructions. Following a comprehensive review by the Medical Image Computing and Computer Assisted Intervention Society (MICCAI) Special Interest Group for Biomedical Image Analysis Challenges, future challenges are now required to adhere to stricter standards, including the mandatory publication of labeling instructions.

Second, we turn our focus to internal QA, which at the time represented a standard practice in annotation workflows for professional annotation companies. We demonstrate that this additional step yields only marginal improvements in annotation quality. In resource-constrained scenarios, we further show that investment in labeling instructions provides greater benefits than additional QA. To guide more effective QA, we developed a statistical model and identified generalizable image characteristics for which QA is most beneficial. This ultimately enables users to selectively apply QA to images with a high likelihood of improvement instead of processing every image.

Third, we show that with the growing number of benchmarks, domain users must carefully select their Vision-Language Model (VLM) of choice. To support users with their domain-specific benchmarking, we developed DomainBench, a framework for turning user-selected image datasets into diverse and scalable VLM benchmarks. To help the community adopt our framework, we released seven new datasets covering a wide range of domains from kitchen environments to animal scenes. To complement the generation process, we introduced a new metric, Accuracy%(t), that accounts for shared base images across different tasks, which previous work neglects.

Lastly, we apply DomainBench to the surgical domain, where robust translation of AI methods is especially critical, and provide the first comprehensive assessment of VLMs in surgical imaging by benchmarking 15 SOTA VLMs across more than 150,000 question-answer pairs. We demonstrate that current VLMs can handle elementary surgical perception tasks but struggle on questions that require medically informed reasoning. We further provide a critical comparison of medical versus generalist VLMs, uncovering that specialized medical VLMs are outperformed by generalist VLMs. Our findings emphasize the need for future work on knowledge integration strategies that extend beyond conventional finetuning.

Together, these contributions highlight that translation in Machine Learning (ML) depends less on incremental architectural advances than on careful design of the model development pipeline. By addressing labeling instructions, internal QA, user-centric benchmarking, and domain-specific validation, this thesis provides both methodological insights and practical frameworks that strengthen the reliability and applicability of ML models. In doing so, it contributes to aligning methodological progress with real-world impact across natural and medical imaging and ultimately improving the scientific standards in the field.

Document type:	Dissertation
Supervisor:	Maier-Hein, Prof. Dr. Lena
Place of Publication:	Heidelberg
Date of thesis defense:	21 January 2026
Date Deposited:	10 Mar 2026 10:08
Date:	2026
Faculties / Institutes:	Fakultät für Ingenieurwissenschaften > Dekanat der Fakultät für Ingenieurwissenschaften
DDC-classification:	000 Generalities, Science 004 Data processing Computer science 500 Natural sciences and mathematics 610 Medical sciences Medicine