[r/LocalLLaMA]score: 0.17
Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results
May 6, 2026
Havenoammo on HuggingFace released Qwen3.6-35B-A3B-MTP-GGUF, a GGUF-quantized MoE model with grafted Multi-Token Prediction layers targeting speculative decoding speedups in llama.cpp. Results are architecture-dependent: Q4 yields only 6% throughput gain and Q8 just 2.5% on a 5090, versus 2 to 2.5x on the 27B dense variant, suggesting MTP efficiency is constrained by the qwen35moe MoE routing implementation. A dual-GPU setup of 2x 5070 Ti plus 3090 pushed Q8 from 110 to 165 tokens per second, indicating multi-GPU configurations benefit disproportionately. Local inference practitioners running MoE architectures should benchmark against their specific hardware before expecting dense-model-equivalent MTP gains.
resources