[r/LocalLLaMA]score: 0.15
RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help
May 20, 2026
RTX 5080 16GB benchmarks show Qwen3-235B-A22B (35B active MoE) at Q4_K_XL quantization achieves 56 tok/s generation and 1,584 tok/s prompt processing at 128k context via llama.cpp. Key finding: Multi-Token Prediction (newly merged in b9190) provides no throughput benefit at long contexts, both converging to identical speeds — skip MTP for coding-agent workloads.
discussion