Direkt zum Inhalt
  1. Publizieren |
  2. Suche |
  3. Browsen |
  4. Neuzugänge rss |
  5. Open Access |
  6. Rechtsfragen |
  7. EnglishCookie löschen - von nun an wird die Spracheinstellung Ihres Browsers verwendet.

WeLT: Weighted Loss Trainer for Biomedical Joint Entity and Relation Extraction

Mobasher, Ghadeer

[thumbnail of Thesis.pdf]
Vorschau
PDF, Englisch - Hauptdokument
Download (11MB) | Lizenz: Creative Commons LizenzvertragWeLT: Weighted Loss Trainer for Biomedical Joint Entity and Relation Extraction von Mobasher, Ghadeer steht unter einer Creative Commons Namensnennung-Nicht kommerziell 4.0

Zitieren von Dokumenten: Bitte verwenden Sie für Zitate nicht die URL in der Adresszeile Ihres Webbrowsers, sondern entweder die angegebene DOI, URN oder die persistente URL, deren langfristige Verfügbarkeit wir garantieren. [mehr ...]

Abstract

The exponential growth of unstructured textual data has emphasised the need for Information Extraction (IE) to transform raw text into actionable knowledge. IE involves automatically identifying and categorising relevant entities, relationships, and events within large text corpora. The ability to extract pertinent information from vast and complex datasets automatically and accurately has profound implications, from advancing personalised medicine and clinical research to enhancing the efficiency of information flow in news and media outlets. Pre-annotations generated by IE systems help alleviate the labour-intensive workload of data annotators by automating the initial labelling of entities, relationships, and events. This automation reduces the need for manual identification, allowing annotators to focus on verifying and refining the pre-annotated data, which significantly speeds up the annotation process.

Supervised learning is one of the primary IE approaches that involve using labelled datasets to train models. Thus, there are considerable efforts by domain experts to curate gold-standard datasets. However, real-world data frequently inherit class imbalance, which remains a significant challenge in IE, where more frequent majority classes often overshadow minority classes that represent rare but critical entities. This imbalance leads to degraded performance, particularly in recognising and extracting under-represented classes.

Current literature offers several approaches to mitigate class imbalance, such as undersampling, oversampling, and static weighting loss. However, these methods have notable drawbacks. Oversampling can lead to over-fitting while undersampling risks discarding valuable data. Fixed weighting loss schemes require extensive manual hyper-parameter tuning, which is time-consuming and often fails to adapt to the unique characteristics of a dataset. These approaches do not address the core issue: the need for the model to adaptively learn from the natural class distribution without biasing its performance towards majority classes.

In response to these limitations, this thesis introduces the Weighted Loss Trainer (WeLT), a novel adaptive loss function designed to address class imbalance. WeLT adjusts class weights based on the relative frequency of each class within the dataset, ensuring that misclassifications of minority classes are penalised more heavily. This approach allows the model to remain sensitive to minority classes without requiring extensive manual tuning or compromising data integrity.

Evaluations conducted on gold-standard datasets, including biomedical and newswire datasets, focused on Named Entity Recognition (NER) and Joint Named Entity Recognition and Relation Extraction (JNERE). Specifically, WeLT was tested on two JNERE paradigms: (a) span-based and (b) table-filling approaches. Additionally, the impact of WeLT NER on Named Entity Linking was compared to vanilla NER methods that neglect class imbalance. Our experiments demonstrate that WeLT effectively addresses class imbalance issues, outperforming traditional fine-tuning approaches and proving advantages over existing weighting loss schemes.

Dokumententyp: Dissertation
Erstgutachter: Gertz, Prof. Dr. Michael
Ort der Veröffentlichung: Heidelberg
Tag der Prüfung: 23 Januar 2025
Erstellungsdatum: 28 Jan. 2025 06:40
Erscheinungsjahr: 2025
Institute/Einrichtungen: Fakultät für Mathematik und Informatik > Institut für Informatik
DDC-Sachgruppe: 004 Informatik
Normierte Schlagwörter: Information Extraction, Class Imbalance
Leitlinien | Häufige Fragen | Kontakt | Impressum |
OA-LogoDINI-Zertifikat 2013Logo der Open-Archives-Initiative