[r/LocalLLaMA]score: 0.19
MTP on strix halo with llama.cpp (PR #22673)
May 5, 2026
Multi-Token Prediction lands in llama.cpp via PR #22673, enabling speculative decoding with up to 3 draft tokens using --spec-type mtp --spec-draft-n-max 3. On AMD AI Max 395 with 128GB DDR5-8000, Qwen3.6-35BA3B-MTP-GGUF jumps from ~40 to 60-80 tokens/s under Vulkan, nearly doubling throughput with zero prefill regression. The 36GB MTP GGUF pairs a draft head alongside the base model, accelerating token-predictable workloads like math significantly more than open-ended generation. Local inference practitioners running MoE models on unified-memory APUs should prioritize testing this immediately.
discussion