[HN]score: 0.09
PyTorch CUDA Allocator Fragmentation: When and Why OOMs Happen
June 1, 2026
Edward Yang explains the specific allocation patterns that cause fragmentation in PyTorch's CUDA caching allocator, where free memory exists but cannot serve new requests. Particularly relevant for LLM serving workloads where engineers push GPU memory to its limits and encounter unexpected OOMs.
HOW THIS AFFECTS YOU
●
builderIf you're running LLM inference close to GPU memory limits, this post explains why you're seeing unexpected OOMs and what allocator behavior to expect.
●
researcherWorth reading for a precise mental model of CUDA caching allocator fragmentation conditions, useful when designing memory-efficient training or inference systems.