HACKOBAR_item
[r/LocalLLaMA]score: 0.16

Why people cares token/s in decoding more?

May 6, 2026
A Reddit practitioner surfaces a real-world inference insight: at 15k-token agentic contexts, prefill latency dominates over decode speed, with 64k prompts taking 10+ minutes on Apple Silicon M-series hardware. At 10+ tokens/second decode, output already exceeds human reading pace, making TTFT the true UX bottleneck. Discrete GPU users see inverted bottlenecks due to higher memory bandwidth. MTP and chunked prefill remain underexplored solutions for local deployment practitioners choosing between Qwen3 30B-A3B MoE versus 27B dense tradeoffs.
question | help