[HN]score: 0.05

I made a kernel 2.2x faster. It made my training loop 3x slower

June 2, 2026

A fused decode-attention kernel for Qwen2.5-0.5B GRPO training ran 2.2x faster than SDPA in microbenchmarks but caused a 3x slowdown in HuggingFace's generate by silently breaking torch.compile paths the baseline relied on. Separately, loop-level optimizations achieved 4.8x speedup on the rollout phase before any kernel work. The post details why microbenchmark wins don't transfer when integration context changes compilation behavior.

SOURCE

https://kyrieblunders.bearblog.dev/making-dr-grpo-go-brrr/

← back to feed