[r/LocalLLaMA]score: 0.17

Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results

May 6, 2026

Havenoammo on HuggingFace released Qwen3.6-35B-A3B-MTP-GGUF, a GGUF-quantized MoE model with grafted Multi-Token Prediction layers targeting speculative decoding speedups in llama.cpp. Results are architecture-dependent: Q4 yields only 6% throughput gain and Q8 just 2.5% on a 5090, versus 2 to 2.5x on the 27B dense variant, suggesting MTP efficiency is constrained by the qwen35moe MoE routing implementation. A dual-GPU setup of 2x 5070 Ti plus 3090 pushed Q8 from 110 to 165 tokens per second, indicating multi-GPU configurations benefit disproportionately. Local inference practitioners running MoE architectures should benchmark against their specific hardware before expecting dense-model-equivalent MTP gains.

resources

SOURCE

https://www.reddit.com/r/LocalLLaMA/comments/1t5r4tz/uploaded_unsloth_qwen3635ba3b_ud_xl_models_with/

← back to feed