Cross-lingual Single-document Abstractive Summarization for Science Journalism

Fatima, Mehwish

[thumbnail of thesis_final_printing_15aug (2) (1).pdf]

Preview

PDF, English - main document
Download (5MB) | Lizenz:

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00035413
URN: urn:nbn:de:bsz:16-heidok-354131

Abstract

We introduce a new task - Cross-lingual Science Journalism as a use case of cross-lingual single-document abstractive summarization. Cross-lingual Science Journalism aims to generate popular science summaries in a local language from scientific articles in a source language for non-expert readers. The cross-lingual popular summaries have distinct properties from regular scientific texts, e.g., conciseness and more readability than source articles and different language from the source. Cross-lingual Science Journalism aims to bridge the gap between the curious local community and scientific research. A real-world example of Cross-lingual Science Journalism is Spektrum der Wissenschaft which converts complex English scientific articles into popular science summaries in German for non-expert audiences.

In this thesis, we focus on (1) curating datasets for summarization in general and science journalism in particular, (2) analyzing the performance of existing summarization models, and (3) developing and evaluating models for Cross-lingual Science Journalism.

For data collection and verification, we create two cross-lingual summarization datasets from online sources by devising systematic methods. Our datasets are collected from Spektrum der Wissenschaft and Wikipedia Science Portal for the English-German language pair. A part of the Spektrum dataset comes from their private domain, so it is only accessible with authorized access. However, the Wikipedia dataset is collected from the public domain and is publicly available to the research community. We perform a thorough analysis based on different statistical and readability features to investigate the linguistic properties of our datasets.

As the second step, we evaluate the collected datasets for the summarization task. For this purpose, we apply several existing summarization models and different training and evaluation strategies. We evaluate the performance of existing summarization models on our datasets with automatic and human evaluation. We also analyze the performance of existing abstractive models to find the limitations of those models.

To address the limitations of existing models, we create a pipeline model - Select, Simplify and Rewrite (SSR). The SSR model combines an extractive summarizer, a simplification model and a cross-lingual abstractive summarizer to generate cross-lingual popular science summaries. We empirically investigate the performance of SSR on our datasets and explore the contribution of each component to the performance with three different evaluation metrics. The SSR model performs better than the strong baselines with 99% confidence, further suggested by human judgment and readability analysis.

We further investigate Cross-lingual Science Journalism by developing an end-to-end model - joint training of Simplification and Cross-lingual Summarization (SimCSum) to improve generated summaries' quality. We empirically evaluate the SimCSum model against several baselines with three evaluation metrics. The SimCSum model outperforms the baselines with 99% confidence, further indicated by human evaluation, readability and error analysis.

To conclude, this work provides a preliminary foundation for a new Cross-lingual Science Journalism task and can help it flourish. In the future, our Wikipedia dataset can help the community to explore and further extend the research in the cross-lingual scientific summarization and Cross-lingual Science Journalism fields. Moreover, our models provide a ground for developing cross-lingual scientific summarization and journalism models. Our models are based on generalized methods, so these models and their derived solutions can be deployed in other domains.

Document type:	Dissertation
Supervisor:	Strube, Prof. Dr. Michael
Place of Publication:	Heidelberg
Date of thesis defense:	5 February 2024
Date Deposited:	18 Sep 2024 06:38
Date:	2024
Faculties / Institutes:	Neuphilologische Fakultät > Institut für Computerlinguistik
DDC-classification:	000 Generalities, Science 004 Data processing Computer science 400 Linguistics 500 Natural sciences and mathematics
Controlled Keywords:	Natural Language Processing, Cross-lingual Summarization, Science Journalism, Deep Learning, LLMs