Nonlinear Optimization Techniques Applied to Neural Network Training

Kreis, Leonie

[thumbnail of Dissertation_Leonie_Submission.pdf]

Preview

PDF, English
Download (38MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00038315
URN: urn:nbn:de:bsz:16-heidok-383151

Abstract

This thesis explores some aspects of how techniques from classical, deterministic, nonlinear optimization can be adapted to address the challenges posed by optimization problems that arise in neural network training. Such problems can be formulated as large-scale, stochastic and non-convex parameter estimation problems. Three main contributions are presented. First, we explore a multilevel optimization technique inspired by Multigrid Optimization (MG/Opt) for stochastic optimization problems. Starting with a method tailored to strongly convex quadratic objective functions, and continuing with a version suited to non-convex objective functions, stochastic MG/Opt variants are proposed and analyzed. For strongly convex quadratic problems, transfer operators suited to the Stochastic Gradient Descent (SGD) method are proposed and analyzed theoretically. For non-convex problems, novel convergence results are provided for stochastic bi-level formulations using SGD with both fixed and diminishing step sizes. This analysis highlights both the potential benefits and the limitations of the approach. The practical applications of multilevel structures in neural networks are also discussed. However, not all hierarchical constructions yield practical improvements. Furthermore, multigrid optimization in highly non-convex, stochastic settings remains only partially understood. Second, Sensitivity-based Layer Insertion (SensLI), a sensitivity-based procedure for adaptive layer insertion, is introduced. SensLI formulates network expansion as a constrained sensitivity analysis problem and produces a simple, general selection criterion for promising insertion positions. Numerical experiments on several architectures demonstrate that SensLI can efficiently increase model capacity during training. A theoretical comparison of the computational cost of the layer selection process with that of other adaptive methods is provided and shows its efficiency. Additionally, the method is extended to layer widening. Third, we propose a layer-wise preconditioning framework based on Frobenius-type inner products on the spaces of linear maps. This framework can handle predefined inner products based on prior knowledge of the layer spaces as well as data-driven inner products that adapt during training. We present a covariance-driven construction for the inner products on layer spaces and their resulting preconditioner, which equip layer spaces with non-Euclidean structures that describe the distribution of the layer data. This preconditioner is strongly similar to the Kronecker-Factored Approximate Curvature (K-FAC) method. Numerical studies critically assess the empirical benefits and limitations of this covariance-based preconditioner in practice. Overall, the results demonstrate the value of classical optimization ideas for training neural networks, provided they are carefully tailored to the stochastic, high-dimensional, and non-convex nature of modern models. The thesis concludes with recommendations for future research, including a tighter integration of theory and experiments and a deeper understanding of the circumstances under which specific hierarchies, preconditioners, or insertion rules improve training.

Document type:	Dissertation
Supervisor:	Herzog, Prof. Dr. Roland
Place of Publication:	Heidelberg
Date of thesis defense:	20 March 2026
Date Deposited:	25 Mar 2026 14:56
Date:	2026
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Dean's Office of The Faculty of Mathematics and Computer Science Service facilities > Interdisciplinary Center for Scientific Computing
DDC-classification:	510 Mathematics