Automated Partitioning of CUDA Kernels for Multi-GPU Systems

Braun, Lorenz

German Title: Automatisierte Partitionierung von CUDA Kerneln für Multi-GPU Systeme

[thumbnail of Dissertation-Lorenz-Braun.pdf]

Preview

PDF, English - main document
Download (4MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00033824
URN: urn:nbn:de:bsz:16-heidok-338244

Abstract

Supercomputers and powerful workstations with multiple GPUs have become the state of the art. GPUs are favored for their immense computational power, high memory bandwidth and energy efficiency for highly parallel workloads. The translation of mathematical problems to multi-GPU compute kernels has already been solved in research by using domain-specific languages and libraries or clever analysis and transformation during compilation. The process of optimizing the partitioning of kernels has received very little attention in research. This work explores the viability of automated partitioning of GPU kernels. The problem is approached by modeling the compute graph of selected applications. The Execution of the applications is simulated on a wide range of systems with different interconnects and GPUs. To cut down on simulation time, a simulator was developed for this specific use case. With the simulation results, simple but per-case individual models were created, with which we show that the application behavior can be well predicted. Results show that the automated partitioning is only 10.17% slower than the optimal partitioning.

Analyzing and improving this problem for real applications depends on good information about the compute kernel. Thus, this work additionally considers the problem of obtaining such information. We use profiling of CUDA kernels to obtain information on instruction counts. A LLVM-based compiler extension providing instruction counts per kernel on PTX level is proposed and evaluated. This approach is the advantage that profiling is much faster compared to NVIDIAs profiler nvprof. The average overhead could be improved by a factor of 10 to 13.2 times the normal execution time.

The metrics of the new profiling approach are used to develop a methodology for kernel performance prediction. Because the metrics of the profiler are GPU-independent, they only need to be measured once, which is a great advantage. 168 kernels from the benchmark suites such as Parboil, Rodinia, Polybench-GPU and SHOC are evaluated on five GPUs. The models are based on random forests and are built for execution time and power consumption prediction. The evaluation of the model prediction performance is using cross-validation and the results show that the median average percentage error ranges from 8.86 to 52% for time and from 1.84 to 2.94% for power prediction.

Translation of abstract (German)

Supercomputer und leistungsstarke Workstations mit mehreren Grafikkarten sind mittlerweile Stand der Technik. GPUs werden wegen ihrer immensen Rechenleistung, hohen Speicherbandbreite und Energieeffizienz für hochgradig parallele Berechnungen bevorzugt. Die Übersetzung mathematischer Probleme in Multi-GPU Kernel wurde in der Forschung bereits durch die Verwendung domänenspezifischer Sprachen und Bibliotheken oder durch geschickte Analyse und Transformation während der Kompilierung gelöst. Dem Prozess der Optimierung der Partitionierung von Kernels wurde in der Forschung bisher nur wenig Aufmerksamkeit geschenkt. In dieser Arbeit wird die Machbarkeit einer automatischen Partitionierung von GPU-Kerneln untersucht. Das Problem wird durch Modellierung des Berechnungsgraphen ausgewählter Anwendungen untersucht. Die Ausführung der Anwendungen auf einer breiten Auswahl von Systemen mit unterschiedlichen Interconnects und GPUs wird simuliert. Um die Simulationszeit zu reduzieren, wurde für diesen spezifischen Use-Case ein Simulator entwickelt. Mit den Ergebnissen der Simulation, wurden einfache, aber individuelle Modelle erzeugt, mit denen wir zeigen können, dass das Anwendungsverhalten gut vorhergesagt werden kann. Die Ergebnisse zeigen, dass die automatische Partitionierung nur 10,17% langsamer ist als die optimale Partitionierung.

Die Analyse und Verbesserung dieses Problems für reale Anwendungen hängt von guten Informationen über den Kernel ab. Daher befasst sich diese Arbeit zusätzlich mit dem Problem der Beschaffung solcher Informationen. Wir verwenden Profiling von CUDA-Kernel, um Informationen über die Anzahl der ausgeführten Instruktionen zu erhalten. Es wird eine LLVM-basierte Compiler-Erweiterung präsentiert und evaluiert, die die Anzahl der Instruktionen pro Kernel auf PTX-Ebene ermittelt. Dieser Ansatz hat den Vorteil, dass das Profiling im Vergleich zu NVIDIAs Profiler nvprof viel schneller ist. Der durchschnittliche Overhead konnte um den Faktor 10 bis 13,2 gegenüber der normalen Ausführungszeit verbessert werden.

Die Metriken des neuen Profiling-Ansatzes werden verwendet, um eine Methodik für die Vorhersage der Kernel-Performance zu entwickeln. Da die Metriken des Profilers GPU-unabhängig sind, müssen sie nur einmal gemessen werden, was ein großer Vorteil ist. 168 Kernel aus den Benchmark-Suiten wie Parboil, Rodinia, Polybench-GPU und SHOC werden auf fünf GPUs ausgewertet. Die Modelle basieren auf Random Forests und wurden für die Vorhersage der Ausführungszeit und des Stromverbrauchs entwickelt. Die Ergebnisse zeigen, dass der durchschnittliche prozentuale Fehler bei der Zeitvorhersage zwischen 8,86 und 52% und bei der Stromverbrauchsvorhersage zwischen 1,84 und 2,94% liegt.

Document type:	Dissertation
Supervisor:	Fröning, Prof. Dr. Holger
Place of Publication:	Heidelberg
Date of thesis defense:	4 September 2024
Date Deposited:	12 Sep 2024 15:12
Date:	2023
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Department of Computer Science
Controlled Keywords:	Hochleistungsrechnen, Modellierung, Maschinelles Lernen
Uncontrolled Keywords:	Multi-GPU Computing