On-Premise Medical Information Extraction from German Doctor’s Letters under Clinical Constraints

Richter-Pechański, Phillip

[thumbnail of Thesis_final_revised_2.pdf]

Preview

PDF, English - main document
Download (5MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00038580
URN: urn:nbn:de:bsz:16-heidok-385800

Abstract

A vast amount of German clinical data continues to be stored in unstructured doctor’s letters. To make these data available for clinical routine and research, this thesis develops and rigorously evaluates on-premise methods for medical information extraction (MIE) from these letters, converting free text documents into transparent structured data. Automatic extraction systems must be developed and deployed entirely inside the clinical infrastructure and produce trustworthy outputs. At project start, there was no distributable German corpus and only CPU resources available. With the availability of mid-class GPUs, considering best performance-efficiency trade-offs, our approach progressed from supervised encoders to prompt-tuned encoders and finally PEFT-optimized LLMs. Throughout this thesis, we address strict real-world clinical constraints: limited domain expertise, staff time, compute resources, native-language barriers, and strong transparency requirements.

In the first part, we introduce CARDIO:DE, the first distributable German clinical routine corpus containing 500 de-identified cardiology doctor’s letters with two high-quality annotation layers (paragraph-level section classes and token-level medication information). The corpus was collected and prepared entirely inside the clinical infrastructure, thus provides a study template for other clinics, and supports transparent and reproducible research in German clinical NLP. We used the corpus along with strong baselines as the foundation data for all experiments in this thesis.

In the second part, as mid-class GPUs became available, we evaluated prompt-tuned encoders for multi-class section classification on CARDIO:DE using pattern-exploiting training (PET). We systematically compare general-domain German BERTs (110M, 340M parameter) with domain/task-adapted and clinical variants. Domain- and task-adapted models consistently outperform general-domain and clinical models in few-shot settings. PET outperforms traditional supervised encoders with only 20 shots. Using a larger encoder and adding context further closes the gap to full-data supervision. We combine PET with efficient prompting and contextualization to reduce domain expertise and staff time demands in a clinical native language environment. Shapley value attributions support training data selection and error analysis, improving transparency. Under clinical constraints, compact encoders are sufficient for most section classes, while larger encoders are supportive for complex sections. Further-pretraining on local texts is beneficial for general-domain encoders but not clinical ones. Overall, PET is a resource-efficient, interpretable method for nativelanguage section classification.

In the third part, as token-level tasks exceeded capabilities of prompt-tuned encoders and more advanced GPUs became available, we define medication information extraction as a one-step end-to-end task that extracts medication mentions and links each to further attributes (strength, frequency, reason, etc.). We fine-tune open-source Llama models (8b, 70b) with parameter-efficient methods and format-restricting prompts on English and German (CARDIO:DE) corpora and compare against zero-shot and encoder baselines. A feedback LLM supports validation of uncertain predictions. Llama 70b achieves a new state of the art on English and provides the first benchmark for German. Llama 8b offers the best performance-efficiency trade-off. PEFT with format-restriction reduces hallucinations and malformed outputs and simplifies evaluation. Shapley attributions reveal input contributions to structured output. Overall, our approach minimizes expert/staff time demands, keeps compute demand modest, and improves transparency.

Finally, we deploy the pipeline on unseen German data in two clinical applications: (i) detecting the expected guideline-driven shift in oral anticoagulation from vitamin K antagonists (VKA, e.g. phenprocoumon) to direct oral anticoagulants (DOACs, e.g. apixaban) between 2012 and 2021 (DOACs: 16.9% to 59.9%, VKAs 37.7% to 9.9%), (ii) quantifying polypharmacy in longitudinal letters (2008-2016) from a 20-patient cohort, where 75% of letters list > 5 and 44% list > 10 distinct medications and at patient level, 80% have ever exceeded > 10 medications, often for years. Our findings show that our on-premise MIE models generalize to unseen letters and can support downstream clinical analysis.

Under strict on-premise and transparency constraints, we evaluate evolving NLP methods on real-world German and English data and derive a resource-aware guideline for MIE: use prompt-tuned, further-pretrained encoders for native-language section classification and PEFT-optimized, format-restricted LLMs for complex token-level tasks. Combine both with Shapley-based attributions and feedback LLMs to support transparency and evaluation. In a clinical environment, our models generalize to unseen letters, recover guideline-driven anticoagulation shifts and quantify letter- and patient-level polypharmacy, indicating clinical applicability. We expect that the contributions presented in this thesis will foster on-premise, transparent clinical NLP research in a lower-resource setting and support the development of reliable MIE systems.

Document type:	Dissertation
Supervisor:	Frank, Prof. Dr. Anette
Place of Publication:	Heidelberg
Date of thesis defense:	16 March 2026
Date Deposited:	04 May 2026 10:29
Date:	2026
Faculties / Institutes:	Neuphilologische Fakultät > Institut für Computerlinguistik
Controlled Keywords:	Natural language processing, Large language models, Interpretability, Medical information extraction, Prompt tuning, Cardiology, Medical corpus, Doctor's letters