Improving Neural Sequence-to-Sequence Learning via Data Enhancement

Lam, Tsz Kin

Preview

PDF, English - main document
Download (3MB) | Lizenz:

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00035838
URN: urn:nbn:de:bsz:16-heidok-358381

Abstract

In recent years, deep learning has revolutionized many areas of life as the driving technology of artificial intelligence. One of the reasons for their success is their use of huge amounts of data and computing resources. However, for many applications, such data are scarce, and straightforward solutions to overcome data scarcity via expert annotation or crowd-sourcing are costly or result in low-quality data. The goal of my thesis is to investigate data enhancement algorithms as automatic and cost-effective alternatives to manual data annotation, with the additional benefit of improved robustness and generalization of models trained on the enhanced data. In particular, we investigate algorithms for data augmentation, data selection, and data correction. Our focus is on neural sequence-to-sequence learning which is a fundamental deep learning technique for a wide range of commercial products such as machine translation and speech recognition, which are essential in breaking language barriers between people from different origins.

In data augmentation, we devise algorithms for reassembling new and effective training data within the given parallel data via segmentation and recombination. This within-corpus augmentation algorithms are simple and effective through possessing three properties: 1) \textit{on-the-fly}, 2) \textit{memory-efficient} and 3) \textit{source-target alignment}. We demonstrate their effectiveness on speech recognition and speech-to-text translation.

In data selection, we aim to remove noisy training data with respect to the targeted data instances. We devise algorithm for selecting pseudo labels based on translation performance in a cascade speech-to-text translation system. In addition, we examine the use of Influence Functions, an attribution technique, on neural machine translation. Influence functions are shown to be useful in classification tasks such as image recognition and toxic speech detection. We analyze its properties, and illustrate the challenges when applying it to neural machine translation.

In data correction, we aim at efficient personalization of a neural machine translation system via human-in-the-loop training. We integrate lightweight feedback such as ``keep'', ``delete'' and ``substitute'' into model training under an active learning based interactive process. In our simulation, we show that such lightweight feedback can produce a competitive machine translation model to that trained with standard cross-entropy loss on the gold-reference translations.

Document type:	Dissertation
Supervisor:	Riezler, Prof. Dr. Stefan
Place of Publication:	Heidelberg
Date of thesis defense:	12 June 2023
Date Deposited:	07 Jan 2025 11:54
Date:	2024
Faculties / Institutes:	Neuphilologische Fakultät > Institut für Computerlinguistik
DDC-classification:	004 Data processing Computer science