Cross-lingual Semantic Role Labeling through Translation and Multilingual Learning

Daza Arevalo, Jose Angel

Preview

PDF, English
Download (4MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00031756
URN: urn:nbn:de:bsz:16-heidok-317560
URL: http://www.ub.uni-heidelberg.de/archiv/31756

Abstract

Understanding an event means being able to answer the question Who did what to whom? (and perhaps also how, when, where...). The what in this sentence is called an event, and it is directly linked to a predicate, which admits event-specific roles for participants that take part in the event. Semantic Role Labeling (SRL) is the task of assigning semantic argument structures to words or phrases in a sentence, which comprises the predicate, its sense, the participants, and the roles they play in the event or state of affairs.

Nowadays the prevailing method for SRL is supervised learning, hence the quality of SRL systems is dependent on annotated training resources. In this thesis we address the problem of improving SRL performance for languages other than English. Given that annotation of SRL resources is time consuming, latest improvements on SRL have focused mainly on English; especially since the use of deep learning in Natural Language Processing (NLP) became the state-of-the-art (SOTA), annotated resources in other languages are not sufficient to compete with the latest improvements we witness for English.

Earlier research has tried to address the lack of training resources in specific languages with bilingual annotation projection methods, or monolingual data augmentation approaches to generate more labeled data that can be later used to train a labeler. Instead, we explore in this work a novel and flexible Encoder-Decoder architecture for SRL that is robust enough to work with more than two languages at the same time, immediately benefiting from more available training data. We are the first to apply sequence transduction for monolingual and cross-lingual SRL, and show that the Encoder-Decoder architecture yields competitive performance with the sequence labeling approaches. Moreover, by capitalizing on existing Machine Translation (MT) research, our model is capable of learning to translate from English to other target languages and label predicates and semantic roles on the target side within a single inference step. We show that – similar to multi-source machine translation – the proposed architecture can profit from multiple input languages and knowledge learned during translation to improve labeling performance on the otherwise resource-poor target languages. We see potential for future development of this framework for diverse structured prediction tasks.

In addition, this work addresses the long-standing problem of SRL annotation incompatibility across languages found in existing corpora; these divergences hinder the development of unified multilingual solutions for this task. To address and alleviate this problem, we define an automatic process for creating a new multilingual SRL corpus which is parallel, contains unified predicate senses and semantic roles across languages, and includes a manually validated test set on source and target sides. We demonstrate that this corpus is better suited than existing ones when used for joint multilingual training with neural models on lower-resource languages. Our work on this corpus is restricted to German, French, and Spanish as target languages; however, we see great potential to extend it to further languages.

In short, we propose the first model that is capable of solving the SRL task in a single language, as well as performing cross-lingual SRL via joint translation and semantic argument structure labeling while resorting to high-quality MT. Additionally, our novel annotation projection method allows us to transfer existing annotations into new languages to create a densely labeled parallel cross-lingual SRL resource with human-validated test data.

Document type:	Dissertation
Supervisor:	Frank, Prof. Dr. Anette
Place of Publication:	Heidelberg
Date of thesis defense:	28 May 2021
Date Deposited:	11 Jul 2022 09:06
Date:	2022
Faculties / Institutes:	Neuphilologische Fakultät > Institut für Computerlinguistik
DDC-classification:	004 Data processing Computer science 400 Linguistics 420 English 490 Other languages