Understanding LLM Communictaion

Borzyk, Leandro

Preview

PDF, English
Download (4MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00038590
URN: urn:nbn:de:bsz:16-heidok-385904

Abstract

The training of modern Large Language Models (LLMs) requires distributed computing across Graphics Processing Unit (GPU) clusters, where network communication efficiency critically impacts performance and cost. Existing profiling tools provide either high-level metrics or low-level timing data, but lack the operation-level granularity needed to understand communication patterns during training runs. This thesis presents the NCCL Trace Profiler, a novel methodology for fine-grained analysis of network communication during distributed Large Language Model (LLM) training. The core contribution is a sequence alignment approach that correlates NVIDIA Collective Communications Library (NCCL) debug logs with NVIDIA Nsight Systems (Nsys) kernel traces, fusing semantic metadata with nanosecond-accurate timing without code instrumentation. The result- ing Python tool works with standard profiling outputs, implements automated topology detection, and produces enriched traces for interactive analysis of communication behavior. Using the profiler, this thesis conducts a systematic characterization of communication patterns across Data Parallelism (DP), Tensor Parallelism (TP), Pipeline Parallelism (PP), and Expert Parallelism (EP) in NVIDIA’s Megatron LM framework. The analysis documents operation type distributions, message size characteristics, and per-rank communication volumes, providing the first detailed, per-operation view of how parallelization strategies manifest as network traffic signatures. Furthermore, the work validates theoretical communication volume models against observed measurements, revealing that while the Data Parallelism (DP) model achieves excellent accuracy, Tensor Parallelism (TP) and Expert Parallelism (EP) models exhibit systematic underestimation due to fine-grained synchronization and routing overhead not captured in idealized models.

Document type:	Master's thesis
Supervisor:	Fröning, Prof. Dr. Holger
Place of Publication:	Heidelberg
Date of thesis defense:	2026
Date Deposited:	30 Apr 2026 10:27
Date:	2026
Faculties / Institutes:	Service facilities > Institut f. Technische Informatik (ZITI)
DDC-classification:	004 Data processing Computer science
Collection:	Institute of Computer Engineering - Selected theses