[r/LocalLLaMA]score: 0.19

MTP on strix halo with llama.cpp (PR #22673)

May 5, 2026

Multi-Token Prediction lands in llama.cpp via PR #22673, enabling speculative decoding with up to 3 draft tokens using --spec-type mtp --spec-draft-n-max 3. On AMD AI Max 395 with 128GB DDR5-8000, Qwen3.6-35BA3B-MTP-GGUF jumps from ~40 to 60-80 tokens/s under Vulkan, nearly doubling throughput with zero prefill regression. The 36GB MTP GGUF pairs a draft head alongside the base model, accelerating token-predictable workloads like math significantly more than open-ended generation. Local inference practitioners running MoE models on unified-memory APUs should prioritize testing this immediately.

discussion

SOURCE

https://i.redd.it/xvtyf87u6ezg1.png

← back to feed