Code Generation for High Performance PDE Solvers on Modern Architectures

Kempf, Dominic

German Title: Codegenerierung für hochperformante PDE-Löser auf modernen Architekturen.

[thumbnail of dominickempf_dissertation.pdf]

PDF, English - main document
Download (1MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00027360
URN: urn:nbn:de:bsz:16-heidok-273605

Abstract

Numerical simulation with partial differential equations is an important discipline in high performance computing. Notable application areas include geosciences, fluid dynamics, solid mechanics and electromagnetics. Recent hardware developments have made it increasingly hard to achieve very good performance. This is both due to a lack of numerical algorithms suited for the hardware and efficient implementations of these algorithms not being available.

Modern CPUs require a sufficiently high arithmetic intensity in order to unfold their full potential. In this thesis, we use a numerical scheme that is well-suited for this scenario: The Discontinuous Galerkin Finite Element Method on cuboid meshes can be implemented with optimal complexity exploiting the tensor product structure of basis functions and quadrature formulae using a technique called sum factorization. A matrix-free implementation of this scheme significantly lowers the memory footprint of the method and delivers a fully compute-bound algorithm.

An efficient implementation of this scheme for a modern CPU requires maximum use of the processor’s SIMD units. General purpose compilers are not capable of autovectorizing traditional PDE simulation codes, requiring high performance implementations to explicitly spell out SIMD instructions. With the SIMD width increasing in the last years (reaching its current peak at 512 bits in the Intel Skylake architecture) and programming languages not providing tools to directly target SIMD units, such code suffers from a performance portability issue. This work proposes generative programming as a solution to this issue.

To this end, we develop a toolchain that translates a PDE problem expressed in a domain specific language into a piece of machine-dependent, optimized C++ code. This toolchain is embedded into the existing user workflow of the DUNE project, an open source framework for the numerical solution of PDEs. Compared to other such toolchains, special emphasis is put on an intermediate representation that enables performance-oriented transformations. Furthermore, this thesis defines a new class of SIMD vectorization strategies that operate on batches of subkernels within one integration kernel. The space of these vectorization strategies is explored systematically from within the code generator in an autotuning procedure.

We demonstrate the performance of our vectorization strategies and their implementation by providing measurements on the Intel Haswell and Intel Skylake architectures. We present numbers for the diffusion-reaction equation, the Stokes equations and Maxwell’s equations, achieving up to 40% of the machine’s theoretical floating point performance for an application of the DG operator.

Translation of abstract (German)

Die numerische Simulation mit partiellen Differentialgleichungen ist eine wichtige Teildisziplin des Höchstleistungsrechnens. Ihre Anwendungsgebiete umfassen beispielsweise Geowissenschaften, Fluiddynamik, Festkörpermechanik oder Elektromagnetismus. Durch die Entwicklungen der letzten Jahre im Hardwaresektor ist es zunehmend schwer geworden, sehr gute Performance zu erzielen. Gründe hierfür sind sowohl ein Mangel an numerischen Algorithmen, die gut für die Hardware geeignet sind, als auch ein Mangel an effizienten Implementierungen dieser Algorithmen.

In dieser Arbeit verwenden wir ein numerisches Verfahren, welches effizient auf moderner Hardware implementiert werden kann: Das unstetige Galerkinverfahren (Discontinuous Galerkin Finite Element Method) kann auf Hexaedergittern, unter Ausnutzung der Tensorproduktstruktur der Basisfunktionen und Quadraturformeln, mit optimaler Komplexität implementiert werden. Dies wird als Summenfaktorisierung bezeichnet. Die selben Algorithmen können verwendet werden um die Anwendung eines Operators zu implementieren, der in herkömmlichen Finite Elemente Methoden in eine Datenstruktur für dünnbesetzte Matrizen assembliert wird. Im Gegensatz zu assemblierten Matrizen erlaubt dies einen Algorithmus, dessen Performance durch die Rechenleistung des Prozessors und nicht durch seine Speicherbandbreite limitiert ist.

Effiziente Implementierungen dieses Verfahrens auf modernen CPUs müssen für eine bestmögliche Auslastung der SIMD-Einheiten des Prozessors sorgen. Da Standardcompiler für PDE-Probleme keinen zufriedenstellend vektorisierten Code generieren, muss SIMD Vektorisierung explizit im Quellcode vorgenommen werden. Die verfügbare SIMD Breite ist in den letzten Jahren stetig angestiegen (bis hin zu einer Breite von 512 bits in der Intel Skylake Architektur). Da Programmiersprachen jedoch kaum Sprachmittel für explizite SIMD Vektorisierung zur Verfügung stellen, ist es schwierig dies auf hardwareunabhängige Weise zu tun. Diese Arbeit schlägt generative Programmierung als Lösungsansatz für dieses Performanceportabilitätsproblem vor.

Zu diesem Zweck wird im Rahmen dieser Arbeit eine Toolchain entwickelt, welche ein in einer domänenspezifischen Sprache beschriebenes PDE Problem in hardware-spezifischen, optimierten C++ Code übersetzt. Diese Toolchain ist in den Userworkflow des DUNE-Projekts eingebettet, einem quelloffenen C++ Framework zur numerischen Lösung partieller Differentialgleichungen. Hierbei liegt das Hauptaugenmerk auf der Verwendung einer Zwischenrepräsentation, welche performanceorientierte Transformationen erlaubt. Desweiteren führt diese Arbeit eine neue Klasse von SIMD Vektorisierungsstrategien ein, welche Batches von Unterkerneln innerhalb eines Integrationskernels zusammenfasst. Der Codegenerator traversiert die Menge dieser Vektorisierungsstrategien systematisch im Rahmen eines Autotuning-Prozesses.

Die Performance unserer Vektorisierungsstrategien und ihrer Implementierung wird durch Messungen auf den Intel Haswell und Intel Skylake Architekturen belegt. Dabei werden die Diffusions-Reaktions-Gleichung, die Stokes-Gleichungen sowie Maxwell’s Gleichungen als Beispiele herangezogen. Für die Anwendung eines DG-Operators erzielen wir eine Maschinenauslastung von bis zu 40%.

Document type:	Dissertation
Supervisor:	Bastian, Prof. Dr. Peter
Place of Publication:	Heidelberg
Date of thesis defense:	25 October 2019
Date Deposited:	14 Nov 2019 09:38
Date:	2019
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Dean's Office of The Faculty of Mathematics and Computer Science The Faculty of Mathematics and Computer Science > Department of Computer Science Service facilities > Interdisciplinary Center for Scientific Computing
DDC-classification:	004 Data processing Computer science 510 Mathematics