Exploiting BSP Abstractions for Compiler Based Optimizations of GPU Applications on multi-GPU Systems

Matz, Alexander

Preview

PDF, English - main document
Download (6MB) | Lizenz: ['licenses_description_cc_0' not defined

title: Exploiting BSP Abstractions for Compiler Based Optimizations of GPU Applications on multi-GPU Systems
authors: Matz, Alexander

]

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00029213
URN: urn:nbn:de:bsz:16-heidok-292133
URL: http://www.ub.uni-heidelberg.de/archiv/29213

Abstract

Graphics Processing Units (GPUs) are accelerators for computers and provide massive amounts of computational power and bandwidth for amenable applications. While effectively utilizing an individual GPU already requires a high level of skill, effectively utilizing multiple GPUs introduces completely new types of challenges. This work sets out to investigate how the hierarchical execution model of GPUs can be exploited to simplify the utilization of such multi-GPU systems. The investigation starts with an analysis of the memory access patterns exhibited by applications from common GPU benchmark suites. Memory access patterns are collected using custom instrumentation and a simple simulation then analyzes the patterns and identifies implicit communication across the different levels of the execution hierarchy. The analysis reveals that for most GPU applications memory accesses are highly localized and there exists a way to partition the workload so that the communication volume grows slower than the aggregated bandwidth for growing numbers of GPUs. Next, an application model based on Z-polyhedra is derived that formalizes the distribution of work across multiple GPUs and allows the identification of data dependencies. The model is then used to implement a prototype compiler that consumes single-GPU programs and produces executables that distribute GPU workloads across all available GPUs in a system. It uses static analysis to identify memory access patterns and polyhedral code generation in combination with a dynamic tracking system to efficiently resolve data dependencies. The prototype is implemented as an extension to the LLVM/Clang compiler and published in full source. The prototype compiler is then evaluated using a set of benchmark applications. While the prototype is limited in its applicability by technical issues, it provides impressive speedups of up to 12.4x on 16 GPUs for amenable applications. An in-depth analysis of the application runtime reveals that dependency resolution takes up less than 10% of the runtime, often significantly less. A discussion follows and puts the work into context by presenting and differentiating related work, reflecting critically on the work itself and an outlook of the aspects that could be explored as part of this research. The work concludes with a summary and a closing opinion.

Document type:	Dissertation
Supervisor:	Fröning, Prof. Dr. Holger
Place of Publication:	Heidelberg
Date of thesis defense:	20 November 2020
Date Deposited:	15 Dec 2020 13:27
Date:	2020
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Department of Computer Science
DDC-classification:	004 Data processing Computer science
Controlled Keywords:	Grafikprozessor, Compilier, BSP <Computerarchitektur>