Preview |
PDF, English
- main document
Download (19MB) | Lizenz: Creative Commons Attribution 4.0
|
Abstract
Radiology is at the forefront of adopting artificial intelligence (AI) solutions in clinical practice because the steadily increasing need for examinations based on medical imaging exceeds the growth in the human workforce. Semantic segmentation is an important component of image analysis pipelines, including applications in computer-aided diagnosis, radiation therapy planning, and disease monitoring. Nowadays, deep learning (DL) algorithms can perform automatic segmentation of various anatomical structures based on appropriately annotated training datasets. However, these algorithms do not work perfectly and can especially make mistakes when applied to data that has different characteristics than the data the models were trained on. The discrepancy between training and testing data characteristics is called distribution shift and frequently occurs when deploying models in new hospitals. In this thesis, benchmarks were developed for methods that improve the robustness of segmentation methods to such distribution shifts. Two complementary approaches were studied here: Methods that improve the out-of-distribution generalization directly or methods that know when they are wrong (failure detection). Generalization methods were benchmarked in this thesis by organizing an international competition, also known as a challenge. Such challenges are the gold standard in medical image analysis for comparing state-of-the-art algorithms, due to their standardized, fair conditions for all participants. While many competitions are organized each year, they usually use research datasets that originate from a small set of institutions and scanners. Therefore, it is unknown how well algorithms generalize to more diverse multicentric data with distribution shifts that arise in the real world. This thesis introduces the idea of using federated data in the competition setting, which lowers the hurdles for contributing data significantly, as the data does not leave the institution where it was acquired. To perform a federated evaluation, the segmentation algorithms are sent to the institutions in the federation, and results on their performance are communicated back for analyzing the robustness. The concept of federated evaluation benchmarks was implemented here in a competition for the task of brain tumor segmentation, the Federated Tumor Segmentation (FeTS) Challenge. As the first federated challenge conducted so far, the FeTS Challenge revealed and partially addressed practical hurdles associated with federated evaluation, notably the high organizational effort, the increased difficulty of annotation quality control compared to conventional challenges, and the constraints on the challenge analysis due to the lack of direct access to federated data. However, it also highlighted the potential of federated benchmarks to boost the dataset size and diversity considerably, exemplified by the testing dataset of the FeTS Challenge, to which 32 international institutions contributed 2625 cases with multi-parametric magnetic resonance imaging (MRI) scans. Evaluating the 41 segmentation models submitted to the competition on the test data showed that they obtained good average-case generalization, but also a lack of worst-case robustness on 13 of the 32 institutions. Failure detection is important for the reliability of segmentation methods in practice, so it has been studied from many perspectives, including uncertainty estimation, out-of-distribution detection, and segmentation quality estimation. Progress in method development is currently hindered for two reasons: The evaluation protocols used by the above approaches differ, making cross-comparison of methods towards the same goal of failure detection difficult. Furthermore, novel methods have often been evaluated only in a single segmentation task (anatomical region) or not considering distribution shifts, which leaves questions about their generalizability unanswered. Therefore, the second part of the thesis addresses these shortcomings, by developing an evaluation protocol based on a risk-coverage analysis, which allows comparing all relevant methods in failure detection while avoiding pitfalls in current practice. A benchmark was designed that implemented the proposed evaluation and compared several, diverse failure detection methods in experiments with multiple public datasets that contain realistic distribution shifts. The benchmark results provided insights into how uncertainties on the pixel level can be effectively aggregated into image-level uncertainties for failure detection. Moreover, an existing, simple method was identified as a strong baseline for future research, as it consistently outperformed more complicated algorithms across datasets. Due to its flexibility and efficiency, it can be easily adapted to new segmentation tasks and practical applications. In conclusion, large-scale benchmarking studies were conducted in this thesis, which test state-of-the-art generalization and failure detection algorithms in scenarios that simulate performance in real-world deployments. The experiments demonstrated how to employ multicentric data in centralized and federated form for evaluating robustness to distribution shifts, revealing common failure sources, and identifying practical algorithms that are able to generalize to new hospitals and abstain from uncertain predictions. The code for both benchmarks is made available to the community to foster meaningful method comparison and progress in robust medical image segmentation algorithms.
| Document type: | Dissertation |
|---|---|
| Supervisor: | Maier-Hein, Prof. Dr. rer. nat. Klaus |
| Place of Publication: | Heidelberg |
| Date of thesis defense: | 26 November 2025 |
| Date Deposited: | 10 Feb 2026 10:03 |
| Date: | 2025 |
| Faculties / Institutes: | Medizinische Fakultät Heidelberg > Dekanat der Medizinischen Fakultät Heidelberg Service facilities > German Cancer Research Center (DKFZ) |
| DDC-classification: | 004 Data processing Computer science 600 Technology (Applied sciences) |








