Preview |
PDF, English
Download (38MB) | Terms of use |
Abstract
This thesis explores some aspects of how techniques from classical, deterministic, nonlinear optimization can be adapted to address the challenges posed by optimization problems that arise in neural network training. Such problems can be formulated as large-scale, stochastic and non-convex parameter estimation problems. Three main contributions are presented. First, we explore a multilevel optimization technique inspired by Multigrid Optimization (MG/Opt) for stochastic optimization problems. Starting with a method tailored to strongly convex quadratic objective functions, and continuing with a version suited to non-convex objective functions, stochastic MG/Opt variants are proposed and analyzed. For strongly convex quadratic problems, transfer operators suited to the Stochastic Gradient Descent (SGD) method are proposed and analyzed theoretically. For non-convex problems, novel convergence results are provided for stochastic bi-level formulations using SGD with both fixed and diminishing step sizes. This analysis highlights both the potential benefits and the limitations of the approach. The practical applications of multilevel structures in neural networks are also discussed. However, not all hierarchical constructions yield practical improvements. Furthermore, multigrid optimization in highly non-convex, stochastic settings remains only partially understood. Second, Sensitivity-based Layer Insertion (SensLI), a sensitivity-based procedure for adaptive layer insertion, is introduced. SensLI formulates network expansion as a constrained sensitivity analysis problem and produces a simple, general selection criterion for promising insertion positions. Numerical experiments on several architectures demonstrate that SensLI can efficiently increase model capacity during training. A theoretical comparison of the computational cost of the layer selection process with that of other adaptive methods is provided and shows its efficiency. Additionally, the method is extended to layer widening. Third, we propose a layer-wise preconditioning framework based on Frobenius-type inner products on the spaces of linear maps. This framework can handle predefined inner products based on prior knowledge of the layer spaces as well as data-driven inner products that adapt during training. We present a covariance-driven construction for the inner products on layer spaces and their resulting preconditioner, which equip layer spaces with non-Euclidean structures that describe the distribution of the layer data. This preconditioner is strongly similar to the Kronecker-Factored Approximate Curvature (K-FAC) method. Numerical studies critically assess the empirical benefits and limitations of this covariance-based preconditioner in practice. Overall, the results demonstrate the value of classical optimization ideas for training neural networks, provided they are carefully tailored to the stochastic, high-dimensional, and non-convex nature of modern models. The thesis concludes with recommendations for future research, including a tighter integration of theory and experiments and a deeper understanding of the circumstances under which specific hierarchies, preconditioners, or insertion rules improve training.
| Document type: | Dissertation |
|---|---|
| Supervisor: | Herzog, Prof. Dr. Roland |
| Place of Publication: | Heidelberg |
| Date of thesis defense: | 20 March 2026 |
| Date Deposited: | 25 Mar 2026 14:56 |
| Date: | 2026 |
| Faculties / Institutes: | The Faculty of Mathematics and Computer Science > Dean's Office of The Faculty of Mathematics and Computer Science Service facilities > Interdisciplinary Center for Scientific Computing |
| DDC-classification: | 510 Mathematics |







