title: Improving Neural Sequence-to-Sequence Learning via Data Enhancement creator: Lam, Tsz Kin subject: ddc-004 subject: 004 Data processing Computer science description: In recent years, deep learning has revolutionized many areas of life as the driving technology of artificial intelligence. One of the reasons for their success is their use of huge amounts of data and computing resources. However, for many applications, such data are scarce, and straightforward solutions to overcome data scarcity via expert annotation or crowd-sourcing are costly or result in low-quality data. The goal of my thesis is to investigate data enhancement algorithms as automatic and cost-effective alternatives to manual data annotation, with the additional benefit of improved robustness and generalization of models trained on the enhanced data. In particular, we investigate algorithms for data augmentation, data selection, and data correction. Our focus is on neural sequence-to-sequence learning which is a fundamental deep learning technique for a wide range of commercial products such as machine translation and speech recognition, which are essential in breaking language barriers between people from different origins. In data augmentation, we devise algorithms for reassembling new and effective training data within the given parallel data via segmentation and recombination. This within-corpus augmentation algorithms are simple and effective through possessing three properties: 1) \textit{on-the-fly}, 2) \textit{memory-efficient} and 3) \textit{source-target alignment}. We demonstrate their effectiveness on speech recognition and speech-to-text translation. In data selection, we aim to remove noisy training data with respect to the targeted data instances. We devise algorithm for selecting pseudo labels based on translation performance in a cascade speech-to-text translation system. In addition, we examine the use of Influence Functions, an attribution technique, on neural machine translation. Influence functions are shown to be useful in classification tasks such as image recognition and toxic speech detection. We analyze its properties, and illustrate the challenges when applying it to neural machine translation. In data correction, we aim at efficient personalization of a neural machine translation system via human-in-the-loop training. We integrate lightweight feedback such as ``keep'', ``delete'' and ``substitute'' into model training under an active learning based interactive process. In our simulation, we show that such lightweight feedback can produce a competitive machine translation model to that trained with standard cross-entropy loss on the gold-reference translations. date: 2024 type: Dissertation type: info:eu-repo/semantics/doctoralThesis type: NonPeerReviewed format: application/pdf identifier: https://archiv.ub.uni-heidelberg.de/volltextserver/35838/1/_Tsz_Kin__PhD_thesis.pdf identifier: DOI:10.11588/heidok.00035838 identifier: urn:nbn:de:bsz:16-heidok-358381 identifier: Lam, Tsz Kin (2024) Improving Neural Sequence-to-Sequence Learning via Data Enhancement. [Dissertation] relation: https://archiv.ub.uni-heidelberg.de/volltextserver/35838/ rights: info:eu-repo/semantics/openAccess rights: Please see front page of the work (Sorry, Dublin Core plugin does not recognise license id) language: eng