title: Improving Neural Sequence-to-Sequence Learning via Data Enhancement
creator: Lam, Tsz Kin
subject: ddc-004
subject: 004 Data processing Computer science
description: In recent years, deep learning has revolutionized many areas of life as the driving technology of artificial intelligence. One of the reasons for their success is their use of huge amounts of data and computing resources. However, for many applications, such data are scarce, and straightforward solutions to overcome data scarcity via expert annotation or crowd-sourcing are costly or result in low-quality data. The goal of my thesis is to investigate data enhancement algorithms as automatic and cost-effective alternatives to manual data annotation, with the additional benefit of improved robustness and generalization of models trained on the enhanced data. In particular, we investigate algorithms for data augmentation, data selection, and data correction. Our focus is on neural sequence-to-sequence learning which is a fundamental deep learning technique for a wide range of commercial products such as machine translation and speech recognition, which are essential in breaking language barriers between people from different origins.     In data augmentation, we devise algorithms for reassembling new and effective training data within the given parallel data via segmentation and recombination. This within-corpus augmentation algorithms are simple and effective through possessing three properties: 1) \textit{on-the-fly}, 2) \textit{memory-efficient} and 3) \textit{source-target alignment}. We demonstrate their effectiveness on speech recognition and speech-to-text translation.    In data selection, we aim to remove noisy training data with respect to the targeted data instances. We devise algorithm for selecting pseudo labels based on translation performance in a cascade speech-to-text translation system. In addition, we examine the use of Influence Functions, an attribution technique, on neural machine translation. Influence functions are shown to be useful in classification tasks such as image recognition and toxic speech detection. We analyze its properties, and illustrate the challenges when applying it to neural machine translation.     In data correction, we aim at efficient personalization of a neural machine translation system via human-in-the-loop training. We integrate lightweight feedback such as ``keep'', ``delete'' and ``substitute'' into model training under an active learning based interactive process. In our simulation, we show that such lightweight feedback can produce a competitive machine translation model to that trained with standard cross-entropy loss on the gold-reference translations.
date: 2024
type: Dissertation
type: info:eu-repo/semantics/doctoralThesis
type: NonPeerReviewed
format: application/pdf
identifier: https://archiv.ub.uni-heidelberg.de/volltextserver/35838/1/_Tsz_Kin__PhD_thesis.pdf
identifier: DOI:10.11588/heidok.00035838
identifier: urn:nbn:de:bsz:16-heidok-358381
identifier:   Lam, Tsz Kin  (2024) Improving Neural Sequence-to-Sequence Learning via Data Enhancement.  [Dissertation]     
relation: https://archiv.ub.uni-heidelberg.de/volltextserver/35838/
rights: info:eu-repo/semantics/openAccess
rights: Please see front page of the work (Sorry, Dublin Core plugin does not recognise license id)
language: eng