Preview |
PDF, English
Download (2MB) | Terms of use |
Abstract
Text summarization is a vital tool to make large bodies of texts easily digestible for human readers. To develop powerful automatic summarization systems, researchers rely on evaluation protocols to assess the progress made by newly proposed summarizers. However, as we will show in this thesis, current protocols are not always sufficient to provide reliable feedback on summarizer performance across a wide range of quality dimensions. In this work, we will thus aim to develop a framework for holistic evaluation of text summarization that covers a broad range of quality dimensions and evaluation settings. In addition to this holistic coverage of quality dimensions and settings, two criteria will guide our investigations: Reliability, which ensures evaluations lead to comparable results across different settings, and cost-efficiency, which is critical to ensure evaluations can be run frequently and exhaustively.
We will begin our investigation at the "gold standard" of summarization evaluation, the human evaluation study. Here, we will show weaknesses in current practices that jeopardize their reliability. Our work will formulate concrete proposals to improve current practices to create both more reliable and cost-efficient human studies. Since even cost-efficient human evaluation is still prohibitive for many use cases, we will then turn our attention to automatic evaluation, starting with an assessment of common meta-evaluation practices. We find that current practices are at risk of leading to unreliable conclusions on evaluation metric performance. We will use these insights to conduct an in-depth meta-evaluation of automatic summary coherence measures. In the final two parts of this thesis, we will then focus on automatic evaluation for two important quality dimensions, which have only recently started to receive attention in text summarization: Faithfulness and Bias. For faithfulness, which is the degree to which a summary correctly reproduces facts from the input, we find that currently proposed metrics are usually computationally expensive. This motivates us to search for a cost-efficient automatic faithfulness metric. Finally, we find that social bias, which is a frequently studied phenomenon in other NLP tasks, has not yet been systematically investigated for text summarization. We will thus provide both abstract definitions as well as practical automatic metrics to assess the presence of bias in summarization systems.
As a whole, our work will provide researchers and users who are interested in the performance of summarization systems a toolbox to cost-efficiently and reliably assess summarizers across key quality dimensions.
Document type: | Dissertation |
---|---|
Supervisor: | Markert, Prof. Dr. Katja |
Place of Publication: | Heidelberg |
Date of thesis defense: | 25 February 2025 |
Date Deposited: | 12 Jun 2025 07:40 |
Date: | 2025 |
Faculties / Institutes: | Neuphilologische Fakultät > Institut für Computerlinguistik |
DDC-classification: | 004 Data processing Computer science |