Parallel Asynchronous Matrix Multiplication for a Distributed Pipelined Neural Network

Schmidtobreick, Anke Mareike

[thumbnail of 2017-11-20_thesis-Schmidtobreick_publication.pdf]

Preview

PDF, English
Download (2MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00023737
URN: urn:nbn:de:bsz:16-heidok-237375

Abstract

Machine learning is an approach to devise algorithms that compute an output without a given rule set but based on a self-learning concept. This approach is of great importance for several fields of applications in science and industry where traditional programming methods are not sufficient. In neural networks, a popular subclass of machine learning algorithms, commonly previous experience is used to train the network and produce good outputs for newly introduced inputs. By increasing the size of the network more complex problems can be solved which again rely on a huge amount of training data. Increasing the complexity also leads to higher computational demand and storage requirements and to the need for parallelization. Several parallelization approaches of neural networks have already been considered. Most approaches use special purpose hardware whilst other work focuses on using standard hardware. Often these approaches target the problem by parallelizing the training data. In this work a new parallelization method named poadSGD is proposed for the parallelization of fully-connected, largescale feedforward networks on a compute cluster with standard hardware. poadSGD is based on the stochastic gradient descent algorithm. A block-wise distribution of the network's layers to groups of processes and a pipelining scheme for batches of the training samples are used. The network is updated asynchronously without interrupting ongoing computations of subsequent batches. For this task a one-sided communication scheme is used. A main algorithmic part of the batch-wise pipelined version consists of matrix multiplications which occur for a special distributed setup, where each matrix is held by a different process group. GASPI, a parallel programming model from the field of "Partitioned Global Address Spaces" (PGAS) models is introduced and compared to other models from this class. As it mainly relies on one-sided and asynchronous communication it is a perfect candidate for the asynchronous update task in the poadSGD algorithm. Therefore, the matrix multiplication is also implemented based GASPI. In order to efficiently handle upcoming synchronizations within the process groups and achieve a good workload distribution, a two-dimensional block-cyclic data distribution is applied for the matrices. Based on this distribution, the multiplication algorithm is computed by diagonally iterating over the sub blocks of the resulting matrix and computing the sub blocks in subgroups of the processes. The sub blocks are computed by sharing the workload between the process groups and communicating mostly in pairs or in subgroups. The communication in pairs is set up to be overlapped by other ongoing computations. The implementations provide a special challenge, since the asynchronous communication routines must be handled with care as to which processor is working at what point in time with which data in order to prevent an unintentional dual use of data. The theoretical analysis shows the matrix multiplication to be superior to a naive implementation when the dimension of the sub blocks of the matrices exceeds 382. The performance achieved in the test runs did not withstand the expectations the theoretical analysis predicted. The algorithm is executed on up to 512 cores and for matrices up to a size of 131,072 x 131,072. The implementation using the GASPI API was found not be straightforward but to provide a good potential for overlapping communication with computations whenever the data dependencies of an application allow for it. The matrix multiplication was successfully implemented and can be used within an implementation of the poadSGD method that is yet to come. The poadSGD method seems to be very promising, especially as nowadays, with the larger amount of data and the increased complexity of the applications, the approaches to parallelization of neural networks are increasingly of interest.

Translation of abstract (German)

Ein maschinelles Lernen Modell ist ein künstliches System, das ohne einen vorgegebenen Regelsatz basierend auf vorherigen Erfahrungen eigenständig lernt, zu unbekannten Eingaben passende Lösungen zu produzieren. Dieser Ansatz ist für mehrere Einsatzgebiete von großer Bedeutung sowohl in der Wissenschaft als auch der Industrie, wenn traditionelle Programmiermethoden nicht ausreichen. In neuronalen Netzwerken, einer beliebten Unterklasse der maschinellen Lernalgorithmen, werden vorgegebene Trainingsdaten bestehend aus Ein- und Ausgabewerten verwendet, um das Netzwerk basierend auf der Differenz zwischen dem vorgegebenen Wert und dem Ausgabewert des Netzwerkes zu trainieren. Durch größere Netze können komplexere Probleme abgebildet werden, die wiederum auf das Vorhandensein einer großen Anzahl von Trainingsdaten angewiesen sind. Die Erhöhung der Komplexität führt so zu höherem Rechenbedarf und höheren Speicheranforderungen, weshalb eine Parallelisierung des Trainings sinnvoll scheint. Es wurden bereits mehrere Parallelisierungsansätze für neuronale Netze in Betracht gezogen. Viele Ansätze verwenden Spezialhardware, während sich andere Arbeiten auf die Verwendung von Standardhardware konzentrieren. In dieser Arbeit wird eine neue Parallelisierungsmethode namens poadSGD vorgestellt, welche sich an vollständig verbundene, sehr große Feedforward-Netzwerke richtet, welche auf einem Rechencluster mit Standardhardware ausgeführt werden. PoadSGD basiert auf dem stochastischen Gradientenverfahren und verwendet eine blockweise Verteilung der Netzwerkschichten auf Gruppen von Prozessen, sowie ein Pipeline-Schema, das jeweils für eine Serie von Trainingsdaten ausgeführt wird. Das Netzwerk wird asynchron aktualisiert, ohne die noch laufenden Berechnungen der nachfolgenden Serien zu unterbrechen. Hierfür wird ein einseitiges Kommunikationsschema verwendet. Ein wesentlicher Teil des Algorithmus sind Matrixmultiplikationen, wobei für einen Teil davon der Fall auftritt, dass die Matrizen auf jeweils unterschiedliche Prozessgruppen verteilt sind. In dieser Arbeit wird daher GASPI, ein paralleles Programmiermodell aus dem Bereich der Partitioned Global Address Spaces (PGAS) Modelle, vorgestellt und mit anderen Modellen dieser Klasse verglichen. Da es hauptsächlich auf einseitiger und asynchroner Kommunikation basiert, ist es ein perfekter Kandidat für die Umsetzung des asynchronen Updates im poadSGD-Algorithmus. Daher ist die Matrixmultiplikation auch auf der Basis von GASPI implementiert. Um eine gute Verteilung der Arbeitslast zu erreichen und möglichst effizient mit der Synchronisation, die durch die zusammenarbeitenden Prozessgruppen entsteht, umzugehen, wird für die Matrizen eine zweidimensionale blockzyklische Datenverteilung angewendet. Basierend auf dieser Verteilung wird der Algorithmus für die Matrix-Multiplikation durch eine diagonale Iteration über die Teilblöcke der Ergebnismatrix und die Berechnung dieser Teilblöcke in Untergruppen der Prozessgruppen umgesetzt. Die Rechenlast der Berechnung der Teilblöcke wird zwischen den Gruppen aufgeteilt und die Kommunikation findet hauptsächlich paarweise oder nur in Untergruppen statt. Die paarweise Kommunikation erfolgt so, dass sie von anderen laufenden Berechnungen überlappt wird. Die Implementierung des Algorithmus stellt eine besondere Herausforderung dar, da die asynchronen Kommunikationsroutinen mit Sorgfalt gehandhabt werden müssen. Es muss zu jedem Zeitpunkt klar sein, welcher Prozessor mit welchen Daten arbeitet oder wohin versendet, um eine unbeabsichtigte doppelte Nutzung von Daten zu verhindern. Die theoretische Analyse zeigt, dass die hier präsentierte Matrixmultiplikation einer naiven Implementierung überlegen ist, wenn die Dimension der Teilblöcke der Matrizen größer als 382 ist. Diese Erwartung hat die in den Testläufen erzielte Performance nicht erfüllt. Der Algorithmus wurde auf bis zu 512 Kerne für Matrizen bis zu einer Größe von 131.072 x 131.072 ausgeführt. Die Implementierung mit der GASPI-API wurde für nicht einfach befunden, wobei die API allerdings ein gutes Potenzial verspricht für die Umsetzung von Anwendungen ohne große Datenabhängigkeiten. Auch wenn die Performance nicht den Erwartung entsprach, so wurde die Matrixmultiplikation erfolgreich umgesetzt und kann für eine Implementierung der poadSGD-Methode verwendet werden. Der Ansatz der poadSGD-Methode ist sehr vielversprechend, zumal heutzutage mit den immer größeren Datensätzen und der erhöhten Komplexität der Anwendungen die Ansätze zur Parallelisierung von neuronalen Netzwerken zunehmend von Interesse sind.

Document type:	Dissertation
Supervisor:	Heuveline, Prof. Dr. Vincent
Date of thesis defense:	27 October 2017
Date Deposited:	23 Nov 2017 10:21
Date:	2017
Faculties / Institutes:	Service facilities > Interdisciplinary Center for Scientific Computing
DDC-classification:	004 Data processing Computer science 510 Mathematics