Topics in structured nonparametric regression: uncoupled isotonic regression and tree-based learning

Blum, Ricardo

German Title: Themen der strukturierten nichtparametrischen Regression: entkoppelte isotone Regression und baumbasierte Lernverfahren

Preview

PDF, English
Download (4MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00038373
URN: urn:nbn:de:bsz:16-heidok-383733

Abstract

This thesis brings together several contributions to nonparametric regression, ranging from isotonic regression with uncoupled data to the analysis of tree-based methods, from a theoretical and applied viewpoint. On the one hand, we consider the estimation of an isotonic regression function in a setting where the pairing between observed responses and design points is unavailable to the statistician. This makes the problem significantly different from standard regression problems where observations come as pairs. Under the assumption that the noise variance decays to zero, we consider two estimators that rely on distributional properties of the observed responses, without assuming knowledge of the noise distribution. We derive consistency rates and asymptotic normality under smoothness assumptions. Furthermore, we identify a phase transition property driven by the decay rate of the noise variance: the estimation problem transitions from being harder than the standard regression problem to being equally hard. On the other hand, in the standard regression model with data available as pairs, we consider tree-based estimation using Random Forests and variants of it. A general consistency result valid for a large class of tree-based algorithms is established, by generalizing a recently introduced sufficient impurity decrease condition. Furthermore, we investigate drawbacks of Random Forests in case of pure interactions, that is, when there are interactions between two or more covariates without the corresponding main effects. In a large simulation study, we observe that Random Forests perform poorly in such settings and we show that simple adaptions of the tree-building procedure - one of which is novel - improves the performance in such settings while not sacrificing predictive power in other scenarios. In addition, we prove that the adaptions are consistently estimating regression functions with pure interactions by employing our general theory. Finally, we also observe differences on real data, with the interaction-specific methods outperforming Random Forests in some examples, and performing comparably in others.

Translation of abstract (German)

In dieser Arbeit werden verschiedene Aspekte der nichtparametrischen Regressionsschätzung untersucht. Einerseits wird die Schätzung einer monoton wachsenden Regressionsfunktion betrachtet, unter der Einschränkung, dass nicht bekannt ist, welche Beobachtung der abhängigen Variablen zu welchem Designpunkt gehört. Dadurch unterscheidet sich das Problem von klassischen Regressionsproblemen, bei denen Beobachtungen als Paare auftreten. Unter der Annahme, dass die Varianz der Fehlervariablen gegen Null konvergiert, werden zwei Schätzer untersucht, welche auf Verteilungseigenschaften der abhängigen Variable beruhen, ohne dabei Kenntnisse über die Fehlerverteilung vorauszusetzen. Unter Glattheitsannahmen werden Konsistenzraten und asymptotische Normalität hergeleitet. Zudem wird die Optimalität der Raten gezeigt und das Schätzproblem wird mit demjenigen im entsprechenden klassischen Regressionsmodell verglichen: Das Schätzproblem ist schwieriger als das Klassische bei langsam abfallender Varianz der Fehlervariablen und ist ebenso schwierig bei schnellerer Abfallrate. Anderseits betrachten wir im klassischen Regressionsmodell die Schätzung mittels Random Forests und verwandten baumbasierten Verfahren des maschinellen Lernens. Für eine große Klasse solcher baumbasierter Verfahren wird Konsistenz gezeigt, wobei angenommen wird, dass die Regressionsfunktion einer verallgemeinerten "Sufficient Impurity Decrease"-Bedingung genügt. Außerdem wird die Schätzung mittels Random Forests im Falle, dass die Regressionsfunktion bestimmte reine Interaktionen ohne begleitende Haupteffekte aufweist, untersucht. In einer umfangreichen Simulationsstudie wird gezeigt, dass Random Forests bei reinen Interaktionstermen unzureichende Schätzungen liefert. Abwandlungen des zugrundeliegenden Baum-Algorithmus schaffen jedoch Abhilfe und deren Konsistenz wird auch im Falle von reinen Interaktionstermen bewiesen. Schließlich werden die Verfahren auf verschiedene Datensätze angewandt, wobei interaktionsspezifische Baumverfahren im Vergleich zu Random Forests in einigen Beispielen bessere Ergebnisse liefern, in den Übrigen werden ähnliche Resultate erzielt.

Document type:	Dissertation
Supervisor:	Mammen, Prof. Dr. Enno
Place of Publication:	Heidelberg
Date of thesis defense:	26 March 2026
Date Deposited:	31 Mar 2026 11:06
Date:	2026
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Institut für Mathematik
DDC-classification:	500 Natural sciences and mathematics 510 Mathematics