PDF, English (Coreference Resolution via Hypergraph Partitioning)
Coreference resolution is one of the most fundamental Natural Language Processing tasks, aiming to identify the coreference relation in texts. The task is to group mentions (i.e. phrases of interest) into sets, so that all mentions in one set refer to the same entity (i.e. a real world object). Mentions are conventionally proper names, common nouns and pronouns. Lately, the coreference task has been extended to deal with verb phrases too. However, we only work with noun phrase mentions in this thesis. By linking mentions together in a document, not only entities are recovered but also different fragments of the context are connected. This therefore leads to a better text understanding. Coreference resolution is essentially important to many applications, such as text summarization and information extraction. In this thesis, we propose a novel coreference model based on hypergraph partitioning. Our system is named COPA, standing for Coreference Partitioner. Given a raw document, COPA represents it as a hypergraph, upon which the hypergraph partitioning algorithms are applied to derive coreference sets directly.
<The Coreference Representation.>
The coreference relation is a high-dimensional relation, because it depends on multiple types of basic relations (e.g. string similarities and semantic relatedness). Most of the previous work on the coreference resolution task combines the basic relations between mentions into single ones and derives the coreference sets afterward. Since it is relatively expensive to learn the combination of the basic relations, we propose a novel hypergraph representation model for coreference resolution. In our model, the mentions are taken as vertices in the hypergraph and the relational features derived from the basic relations as hyperedges. The hypergraph allows for multiple edges between vertices, so that it suits the high-dimension property of the coreference relation. Moreover, in a hypergraph one hyperedge can connect more than two vertices. As a result the hypergraph directly represents the relations between sets of mentions as required for the coreference resolution task.
Since the basic relations are incorporated in an overlapping manner, COPA only needs a few training documents to achieve competitive performance. The weakly supervised nature makes COPA a good candidate when applying to different domains or languages, or when only limited training data is available.
<The Coreference Inference.>
The inference of the coreference resolution task deals with sets of mentions. It needs to capture the relations between multiple mentions in order to derive the final coreference sets. Therefore, we consider coreference resolution as a set problem. Most of the previous coreference models address the set problem by dividing the resolution into two steps --- a classification step and a clustering step. The classification step makes decisions for each pair of mentions on whether they are coreferent or not. Upon the pairwise decisions, the clustering step further groups mentions into the final sets. The two-step division makes the classification performance not necessarily positively correlated with the end evaluation numbers. It is difficult to track the error propagation and hard to optimize with respect to the final coreference sets. Moreover, since the coreference decisions are made between pairs of mentions independently, global context information is missing in those models.
In this thesis, we propose a global coreference model via hypergraph partitioning. We design two algorithms based on the spectral clustering technique --- a hierarchical R2 partitioner and a flat k-way flatK partitioner. We also propose extensions to the clustering algorithms of COPA, aiming to include constraints to enforce the cluster-level consistency. The constrained COPA is the first attempt towards a better learning scheme for our system. It solves the cluster-level inconsistency problem and at the same time contributes to research in the constrained graph clustering field.
<The Coreference Evaluation.>
Since COPA is an end-to-end coreference system, the important implementation issues encountered when applying clustering algorithms to practical uses are also addressed in this thesis. For instance, the existing evaluation metrics become problematic when the automatically identified mentions do not align with the ones in the ground truth. In this thesis, we propose variants of the coreference evaluation metrics to tackle this problem.
COPA outperforms several baseline systems in fair settings, using the same features and the same mentions and only comparing the effectiveness of the models themselves. It also performs competitively compared to the state-of-the-art systems across different evaluation metrics, different data sets and different domains.
|Supervisor:||Strube, Prof. Dr. Michael|
|Date of thesis defense:||4 April 2013|
|Date Deposited:||19 Dec 2013 13:19|
|Faculties / Institutes:||Neuphilologische Fakultät > Institut für Computerlinguistik|
|Subjects:||004 Data processing Computer science
|Uncontrolled Keywords:||Coreference Resolution, Graph Partitioning|