Discovering structure without labels

Damrich, Sebastian

Preview

PDF, English - main document
Download (72MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00033875
URN: urn:nbn:de:bsz:16-heidok-338754
URL: http://www.ub.uni-heidelberg.de/archiv/33875

Abstract

The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction.

The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space.

The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success.

Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to more discrete and local structure of the data. Finally, we emphasize the link between contrastive neighbor embeddings and self-supervised contrastive learning. We show that different flavors of contrastive losses can work for both of them with few noise samples.

Translation of abstract (German)

Die Menge an verfügbaren Daten steigt unaufhörlich, aber annotierte Daten bleiben weiterhin selten. Das macht Methoden attraktiv, welche die Struktur von Daten ohne Trainingsbeispiele lernen können. Diese Arbeit leistet Beiträge zu zwei solchen Bereichen: Clusteranalyse und Dimensionsreduktion.

Die Arbeit besteht aus zwei Teilen. Im ersten Teil führen wir die Methode Mod Shift ein. Sie gruppiert Punkte im euklidischen Raum basierend auf deren Distanz. Nahe Punkte ziehen sich an und entfernte Punkte stoßen sich ab, bis sich klare Gruppen bilden. Die Balance beider Kräfte bestimmt die Anzahl der gefundenen Cluster. Wie Mean Shift bewegt auch Mod Shift die Punkte schrittweise hin zu kompakten Gruppen. Seine Zielfunktion ist jedoch vom Multicut Problem inspiriert, sodass wir eine Verbindung zwischen Graphpartitionierung und Punktgruppierung herstellen.

Der zweite Teil behandelt Dimensionsreduktion mit den populären Nachbareinbettungsmethoden UMAP und t-SNE. Wir zeigen auf, dass sich die Zielfunktion, die von UMAPs Implementierung tatsächlich optimiert wird, deutlich von der bisher akzeptierten unterscheidet. Diese Diskrepanz erklärt einige Artefakte in UMAP-Einbettungen, wie die Tendenz besonders für große Datensätze Einbettungen mit linienartigen Bereichen zu produzieren. Im Gegesatz zum bisherigen Verständnis von UMAP ist dessen Ähnlichkeitsmaß zwischen Punkten im hochdimensionalen Raum nicht maßgebend für die Einbettungsqualität.

Basierend auf UMAPs tatsächlicher Zielfunktion beschreiben wir die genaue Verbindung zur Visualisierungsmethode t-SNE. Der Kernpunkt ist unser neues Verständnis wie sich die beiden kontrastiven Lernmethoden negative sampling, verwendet von UMAP, und noise-contrastive estimation, mit der t-SNE approximiert werden kann, zu einander verhalten. Auf Basis dieser Verbindung finden wir heraus, dass UMAP mehr Anziehung auf benachbarte Datenpunkte ausübt als t-SNE und deswegen kompaktere Einbettungen produziert. Andere Attraktionsniveaus sind ebenso möglich und führen zu einem Kontinuum von Nachbareinbettungsmethoden, das UMAP- und t-SNE-artige Verfahren als Spezialfälle umfasst. Höhere Anziehung stellt die kontinuierlichen und globalen Aspekte eines Datensatzen besser heraus, während stärkere Repulsion die lokale Struktur von diskreten Clustern genauer darstellt. Abschließend weisen wir auf die Verbindung von kontrastiven Nachbareinbettungsmethoden und kontrastivem selbstüberwachtem Lernen hin und zeigen, dass beide mit den selben Zielfunktionen und nur wenigen Kontrastbeispielen optimiert werden können.

Document type:	Dissertation
Supervisor:	Hamprecht, Dr. Fred
Date of thesis defense:	7 December 2022
Date Deposited:	11 Oct 2023 10:42
Date:	2023
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Dean's Office of The Faculty of Mathematics and Computer Science The Faculty of Mathematics and Computer Science > Department of Computer Science
DDC-classification:	004 Data processing Computer science 500 Natural sciences and mathematics 510 Mathematics