The Generalization Ability of Coreference Resolution Systems

Chai, Haixia

[thumbnail of PhD Thesis_Haixia Chai_Final.pdf]

Preview

PDF, English
Download (2MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00037927
URN: urn:nbn:de:bsz:16-heidok-379272

Abstract

Coreference resolution is an intermediate NLP task for natural language understanding, supporting many real-world downstream tasks such as information extraction, summarization, machine translation, and dialogue. Recent end-to-end coreference resolution systems have improved the performance of coreference resolution, with a focus on the CoNLL benchmark(Pradhan et al., 2012), a dataset developed more than a decade ago. However, it does not necessarily mean that these systems are applicable to downstream tasks effectively. Text in downstream tasks could be presented in different genres (e.g., conversation and news), in different languages (e.g., English and German), from different domains (e.g., scientific literature and social media posts), and in different conditions (e.g., standard text and noisy text). It is not entirely clear how well coreference resolvers can perform and how to enhance their resolution ability in these complex practical scenarios.

In this thesis, we study the generalization ability of coreference resolution systems from various perspectives, as outlined below.

First, we investigate multilingual coreference resolution by using universal annotations. Our study involves (1) a corpus-based examination of coreference at different linguistic levels, namely mention, entity and document levels across multiple languages, (2) an error analysis on the most challenging cases that multilingual coreference resolution systems fail to resolve, and (3) the integration of linguistic features from universal morphosyntactic annotations into a baseline system to assess their potential benefits for the task. We found that there are indeed commonalities across languages. For example, we observe a common pattern where the closest antecedent of an overt pronoun mainly corresponds to the subject or object position. Additionally, a common issue encountered in all languages by multilingual coreference resolution systems is the difficulty of correctly detecting nominal nouns within some two-mention entities.

Second, we present a neural coreference model incorporating discourse structure information derived from centering theory. The model captures the centering transition relationships between sentences. Each sentence is encoded with all neighbour sentences in a weighted graph. Our approach outperforms the SOTA baseline with 80.9 F1 score. Especially, it helps resolving pronoun in long documents, text in formal genres and clusters with scattered mentions.

Finally, we evaluate coreference resolution systems in two ways: (1) through extrinsic and intrinsic evaluation on a community-based question answering (CQA) task, and (2) by creating a new dataset derived from the CoNLL dataset containing noisy text. For the extrinsic evaluations, we use coreference resolvers for decontextualizing the individual sentences of candidate answers. For the intrinsic evaluations, we have annotated a subset of CQA data with coreference relations. Our extrinsic evaluations suggest that while there is a significant gap on the performances of state-of-the-art coreference resolver and the rule-based system on coreference datasets, the rule-based system has a more consistent and positive impact on CQA while the impact of the state-of-the-art model can considerably vary based on the domain of the downstream data. Our intrinsic evaluations suggest that there is a discrepancy between the rankings of existing coreference resolution evaluation metrics and the resulting rankings from the extrinsic evaluations. This suggests that intrinsic evaluation on CoNLL should be accompanied by extrinsic evaluation to approximate the utility of the coreference resolvers for downstream tasks. In the second evaluation method, the created dataset largely decreases mention overlaps in the entire dataset and exposes the limitations of published resolvers on two aspects --- lexical inference ability and understanding of low-level orthographic noise. Our experiments show that published resolvers fail to link coreferent mentions involving minor low-level noise and lexical changes.

Document type:	Dissertation
Supervisor:	Strube, Prof. Dr. Michael
Place of Publication:	Heidelberg
Date of thesis defense:	28 August 2024
Date Deposited:	06 Mar 2026 08:36
Date:	2026
Faculties / Institutes:	Neuphilologische Fakultät > Institut für Computerlinguistik
DDC-classification:	400 Linguistics