HACKOBAR_item
[r/LocalLLaMA]score: 0.20

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

May 9, 2026
A practitioner achieved 80+ tok/sec with 128K context on a 12GB RTX 4070 Super running Qwen3.6 35B A3B (a 35B MoE with 3B active params) via llama.cpp with the Multi-Token Prediction PR, hitting 80%+ draft acceptance rate. This makes frontier-class MoE inference viable on consumer GPUs, critical for local deployment enthusiasts.
tutorial | guide