%0 Generic %A Schindler, Günther %C Heidelberg %D 2021 %F heidok:30166 %R 10.11588/heidok.00030166 %T Compressing and Mapping Deep Neural Networks on Edge Computing Systems %U https://archiv.ub.uni-heidelberg.de/volltextserver/30166/ %X Deep neural networks (DNNs) are a key technology nowadays and the main driving factor for many recent advances in Artificial Intelligence (AI) applications, including computer vision, natural language processing and speech recognition. DNNs exhibit the ability to excellently fit training data while also generalizing well to unseen data, which is especially effective when big amounts of data and ample hardware resource are available. These hardware requirements in terms of computations and memory are the limiting factor for their deployment in edge computing systems, such as handheld or head-word devices. Enabling DNNs to be deployed on edge devices is one of the key challenges towards the next generation of AI applications, including augmented reality or enhanced interactions between humans and computers. Three major research directions have to be jointly considered for effective deployment: efficient model design, high-performance hardware as well as cooperating software frameworks. This work studies these research directions from an holistic point of view and carefully considers the impact of one directions to the others, in order to develop techniques that improve the overall deployment. First, efficient model design through compression in form of quantization is studied, to reduce the required data representation from single-precision floating point to low-bit formats. Several quantization techniques are evaluated and a library is introduced that enables arbitrary bit combination on Central Processing Units (CPUs). The potential and implications of mapping quantized DNNs is extensively studied on mobile CPUs as well as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The next part considers the limitations of quantized DNNs and proposes a compression/algorithmic co-design, targeting fast deployment on mobile CPUs while achieving high prediction quality. The proposed compression algorithm is based on an adaptive quantization function that additionally induces sparsity into the DNN. A deployment algorithm is introduced for accelerating computations by exploiting the aggressively-low and sparse data formats, created by the compression technique. The final parts address the disadvantages of extreme forms of quantization and sparsity on GPUs and propose a framework for structure pruning, to enable compressed deployment on a large variety of massively-parallel accelerators. Together with considering design principles of DNNs, a methodology is introduced for targeting efficient deployment for virtually any modern hardware/software stack for DNNs. Several design principles for DNNs are discovered using this methodology, enabling the design of more efficient models without explicit compression.