[HN]score: 0.05
I made a kernel 2.2x faster. It made my training loop 3x slower
June 2, 2026
A fused decode-attention kernel for Qwen2.5-0.5B GRPO training ran 2.2x faster than SDPA in microbenchmarks but caused a 3x slowdown in HuggingFace's generate by silently breaking torch.compile paths the baseline relied on. Separately, loop-level optimizations achieved 4.8x speedup on the rollout phase before any kernel work. The post details why microbenchmark wins don't transfer when integration context changes compilation behavior.