Preview |
PDF, English
Download (4MB) | Terms of use |
Abstract
The training of modern Large Language Models (LLMs) requires distributed computing across Graphics Processing Unit (GPU) clusters, where network communication efficiency critically impacts performance and cost. Existing profiling tools provide either high-level metrics or low-level timing data, but lack the operation-level granularity needed to understand communication patterns during training runs. This thesis presents the NCCL Trace Profiler, a novel methodology for fine-grained analysis of network communication during distributed Large Language Model (LLM) training. The core contribution is a sequence alignment approach that correlates NVIDIA Collective Communications Library (NCCL) debug logs with NVIDIA Nsight Systems (Nsys) kernel traces, fusing semantic metadata with nanosecond-accurate timing without code instrumentation. The result- ing Python tool works with standard profiling outputs, implements automated topology detection, and produces enriched traces for interactive analysis of communication behavior. Using the profiler, this thesis conducts a systematic characterization of communication patterns across Data Parallelism (DP), Tensor Parallelism (TP), Pipeline Parallelism (PP), and Expert Parallelism (EP) in NVIDIA’s Megatron LM framework. The analysis documents operation type distributions, message size characteristics, and per-rank communication volumes, providing the first detailed, per-operation view of how parallelization strategies manifest as network traffic signatures. Furthermore, the work validates theoretical communication volume models against observed measurements, revealing that while the Data Parallelism (DP) model achieves excellent accuracy, Tensor Parallelism (TP) and Expert Parallelism (EP) models exhibit systematic underestimation due to fine-grained synchronization and routing overhead not captured in idealized models.
| Document type: | Master's thesis |
|---|---|
| Supervisor: | Fröning, Prof. Dr. Holger |
| Place of Publication: | Heidelberg |
| Date of thesis defense: | 2026 |
| Date Deposited: | 30 Apr 2026 10:27 |
| Date: | 2026 |
| Faculties / Institutes: | Service facilities > Institut f. Technische Informatik (ZITI) |
| DDC-classification: | 004 Data processing Computer science |
| Collection: | Institute of Computer Engineering - Selected theses |







