TY  - GEN
UR  - https://archiv.ub.uni-heidelberg.de/volltextserver/37230/
ID  - heidok37230
CY  - Heidelberg
Y1  - 2025///
TI  - Reducing Global Memory Accesses in DNN Training using Structured Weight Masking
A1  - Bespalov, Sergej
N2  - Training large deep neural networks (DNNs) is often constrained by memory bandwidth, with frequent global memory accesses representing a significant performance bottleneck. This thesis investigates the potential of dynamic structured weight masking to alleviate this bottleneck during training, focusing on the ResMLP architecture?a feedforward network composed exclusively of Multi-Layer Perceptrons. A novel framework implementing block-wise masking based on L2 norm magnitude and top-k selection was developed and evaluated on the CIFAR-10 dataset. The study systematically varied block
sizes and sparsity ratios, analyzing the impact on classification accuracy, theoretical computational cost (FLOPs), and theoretical memory movement.
Results indicate that model accuracy remains robust up to approximately 50% sparsity when the mask is also applied during the backward pass; beyond this threshold, classification accuracy degradation is observed. Notably, larger blocks contribute to improved computational efficiency under masked backward conditions by offering
hardware-friendly memory access patterns, whereas in unmasked backward passes, smaller blocks tend to perform more favorably in terms of maintaining accuracy. A key observation is the discrepancy between the substantial reduction in computationally active weights and the limited decrease in estimated memory movement, suggesting
that tangible memory savings can only be achieved with hardware-aware implementations that bypass unnecessary data loads. Theoretical FLOPs decrease linearly with increasing sparsity, confirming the potential for computational efficiency gains.
Overall, this work contributes an empirical analysis of dynamic structured weight masking in MLP-based architectures, offering insights into the trade-offs between mask ratio, block granularity, and training stability. The findings underscore the importance of co-designing masking patterns to achieve improvements in both computational cost and memory access, while also highlighting considerations for maintaining training stability. Furthermore, they provide practical guidelines for the efficient training of DNNs on systems with limited memory or computational resources.
AV  - public
ER  -