[r/LocalLLaMA]score: 0.20

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

May 9, 2026

A practitioner achieved 80+ tok/sec with 128K context on a 12GB RTX 4070 Super running Qwen3.6 35B A3B (a 35B MoE with 3B active params) via llama.cpp with the Multi-Token Prediction PR, hitting 80%+ draft acceptance rate. This makes frontier-class MoE inference viable on consumer GPUs, critical for local deployment enthusiasts.

tutorial | guide

SOURCE

https://www.reddit.com/r/LocalLLaMA/comments/1t82zxv/80_toksec_and_128k_context_on_12gb_vram_with/

← back to feed