TY  - GEN
CY  - Heidelberg
KW  - Multi-GPU Computing
A1  - Braun, Lorenz
N2  - Supercomputers and powerful workstations with multiple GPUs have become the state of the art. GPUs are favored for their immense computational power, high memory bandwidth and energy efficiency for highly parallel workloads. The translation of mathematical problems to multi-GPU compute kernels has already been solved in research by using domain-specific languages and libraries or clever analysis and transformation during compilation. The process of optimizing the partitioning of kernels has received very little attention in research. This work explores the viability of automated partitioning of GPU kernels. The problem is approached by modeling the compute graph of selected applications. The Execution of the applications is simulated on a wide range of systems with different interconnects and GPUs. To cut down on simulation time, a simulator was developed for this specific use case. With the simulation results, simple but per-case individual models were created, with which we show that the application behavior can be well predicted. Results show that the automated partitioning is only 10.17% slower than the optimal partitioning.

Analyzing and improving this problem for real applications depends on good information about the compute kernel. Thus, this work additionally considers the problem of obtaining such information. We use profiling of CUDA kernels to obtain information on instruction counts. A LLVM-based compiler extension providing instruction counts per kernel on PTX level is proposed and evaluated. This approach is the advantage that profiling is much faster compared to NVIDIAs profiler nvprof. The average overhead could be improved by a factor of 10 to 13.2 times the normal execution time.

The metrics of the new profiling approach are used to develop a methodology for kernel performance prediction. Because the metrics of the profiler are GPU-independent, they only need to be measured once, which is a great advantage. 168 kernels from the benchmark suites such as Parboil, Rodinia, Polybench-GPU and SHOC are evaluated on five GPUs. The models are based on random forests and are built for execution time and power consumption prediction. The evaluation of the model prediction performance is using cross-validation and the results show that the median average percentage error ranges from 8.86 to 52% for time and from 1.84 to 2.94% for power prediction.
UR  - https://archiv.ub.uni-heidelberg.de/volltextserver/33824/
Y1  - 2023///
ID  - heidok33824
AV  - public
TI  - Automated Partitioning of CUDA Kernels for Multi-GPU Systems
ER  -