[r/LocalLLaMA]score: 0.15
110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp
May 21, 2026
Reddit user reports 110 tok/s on Qwen3 35B A3B (MoE, 3B active params) using ik_llama.cpp with an RTX 4070 Super 12GB, up from ~80 tok/s with llama.cpp after its MTP merge degraded performance. ik_llama.cpp's CPU offload optimizations appear significantly better tuned for hybrid GPU/CPU inference. Relevant for local inference enthusiasts running large MoE models on consumer hardware.
tutorial | guide