Response-Based and Counterfactual Learning for Sequence-to-Sequence Tasks in NLP

Lawrence, Carolin

[thumbnail of 20190510_Thesis_Carolin.pdf]

PDF, English - main document
Download (5MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00026477
URN: urn:nbn:de:bsz:16-heidok-264771

Abstract

Many applications nowadays rely on statistical machine-learnt models, such as a rising number of virtual personal assistants. To train statistical models, typically large amounts of labelled data are required which are expensive and difficult to obtain. In this thesis, we investigate two approaches that alleviate the need for labelled data by leveraging feedback to model outputs instead. Both scenarios are applied to two sequence-to-sequence tasks for Natural Language Processing (NLP): machine translation and semantic parsing for question-answering. Additionally, we define a new question-answering task based on the geographical database OpenStreetMap (OSM) and collect a corpus, NLmaps v2, with 28,609 question-parse pairs. With the corpus, we build semantic parsers for subsequent experiments. Furthermore, we are the first to design a natural language interface to OSM, for which we specifically tailor a parser. The first approach to learn from feedback given to model outputs, considers a scenario where weak supervision is available by grounding the model in a downstream task for which labelled data has been collected. Feedback obtained from the downstream task is used to improve the model in a response-based on-policy learning setup. We apply this approach to improve a machine translation system, which is grounded in a multilingual semantic parsing task, by employing ramp loss objectives. Next, we improve a neural semantic parser where only gold answers, but not gold parses, are available, by lifting ramp loss objectives to non-linear neural networks. In the second approach to learn from feedback, instead of collecting expensive labelled data, a model is deployed and user-model interactions are recorded in a log. This log is used to improve a model in a counterfactual off-policy learning setup. We first exemplify this approach on a domain adaptation task for machine translation. Here, we show that counterfactual learning can be applied to tasks with large output spaces and, in contrast to prevalent theory, deterministic logs can successfully be used on sequence-to-sequence tasks for NLP. Next, we demonstrate on a semantic parsing task that counterfactual learning can also be applied when the underlying model is a neural network and feedback is collected from human users. Applying both approaches to the same semantic parsing task, allows us to draw a direct comparison between them. Response-based on-policy learning outperforms counterfactual off-policy learning, but requires expensive labelled data for the downstream task, whereas interaction logs for counterfactual learning can be easier to obtain in various scenarios.

Translation of abstract (German)

Viele Anwendungen basieren heutzutage auf statistischen, maschinell erlernten Modellen, wie z.B. eine steigende Anzahl von virtuellen persönlichen Assistenten. Um statistische Modelle zu trainieren, sind typischerweise große Mengen an parallelen Daten erforderlich, welche teuer und schwer zu beschaffen sind. In dieser Arbeit werden wir zwei Ansätze untersuchen, die den Bedarf an parallelen Daten verringert, indem stattdessen “Feedback” für Modellausgaben verwendet wird. Beide Szenarien werden auf zwei “Sequence-to-Sequence” Aufgaben für “Natural Language Processing (NLP)” angewendet: Maschinelle Übersetzung und semantisches Parsen für die Beantwortung von Fragen. Zusätzlich definieren wir eine neue Aufgabe für die Beantwortung von Fragen auf Basis der geographische Datenbank OpenStreetMap (OSM). Hierfür sammeln wir einen Korpus, NLmaps v2, mit 28.609 Frage-Parse Paaren. Mit dem Korpus bauen wir semantische Parser für spätere Experimente. Darüber hinaus sind wir die Ersten, die eine natürlichsprachliche Schnittstelle zu OSM entwerfen, wofür wir speziell einen Parser anpassen. Der erste Ansatz, um von “Feedback” für Modellausgaben zu lernen, sieht ein Szenario vor, bei dem ein schwaches Lernsignal vorhanden ist. Das Modell wird in einer nachfolgenden Aufgabe verankert für die parallele Daten vorhanden sind. Das “Feedback”, welches die nachfolgenden Aufgabe vergibt, wird zur Verbesserung des Modells in einem “response-based on-policy learning” Setup verwendet. Dieser Ansatz wird zunächst verwendet, um ein maschinelles Übersetzungssystem zu verbessern. Dieses ist in einem multilingualen semantischen Parsen Problem verankert und es werden “ramp loss objectives” zur Verbesserung des Systems verwendet. Als nächstes verbessern wir einen semantischen Parser für den nur Gold-Antworten, aber keine Gold-Parse, vorhanden sind, in dem wir “ramp loss objectives” auf nicht-lineare neuronale Netzwerke anwenden. Im zweiten Ansatz, um aus “Feedback” zu lernen, wird, anstelle der Sammlung teurer paralleler Daten, ein Modell eingesetzt um NutzerModell Interaktionen in einer Logdatei zu sammeln. Diese Log Datei wird verwendet, um ein Modell in einem “counterfactual off-policy learning” Setup zu verbessern. Wir verwenden diesen Ansatz zunächst um ein maschinelles Übersetzungssystem an eine neue Domäne anzupassen. Hier zeigen wir, dass dieser Ansatz auf Aufgaben mit großen Ausgabemengen angewendet werden kann und, im Gegensatz zu gängiger Theorie, können deterministische Logdateien erfolgreich bei “Sequenceto-Sequence” Aufgaben für “NLP” eingesetzt werden. Als nächstes demonstrieren wir an Hand eines semantischen Parsers, dass der Ansatz auch dann angewendet werden kann, wenn das zugrunde liegende Modell ein neuronales Netzwerk ist und das “Feedback” von menschlichen Nutzern gesammelt wurde. Die Anwendung beider Ansätze auf dasselbe Problem für semantisches Parsen, ermöglicht es uns einen direkten Vergleich zu ziehen. “Response-based on-policy learning” übertrifft “counterfactual off-policy learning”, aber es benötigt teure parallele Daten für die nachfolgende Aufgabe, während Logdateien von Nutzer-System Interakationen für “counterfactual off-policy learning” in verschiedenen Szenarien einfacher zu erhalten sind.

Document type:	Dissertation
Supervisor:	Riezler, Prof. Dr. Stefan
Place of Publication:	Heidelberg, Deutschland
Date of thesis defense:	3 May 2019
Date Deposited:	29 May 2019 06:16
Date:	2019
Faculties / Institutes:	Neuphilologische Fakultät > Institut für Computerlinguistik
DDC-classification:	000 Generalities, Science 004 Data processing Computer science 310 General statistics 400 Linguistics 420 English 490 Other languages 500 Natural sciences and mathematics
Controlled Keywords:	Maschinelles Lernen, Bestärkendes Lernen <Künstliche Intelligenz>
Uncontrolled Keywords:	artificial intelligence, natural language processing, sequence-to-sequence learning, semantic parsing, question-answering, machine translation, learning from feedback, response-based on-policy learning, grounded learning, counterfactual off-policy learning, off-policy reinforcement learning, bandit learning