[r/LocalLLaMA]score: 0.15

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

May 21, 2026

Reddit user reports 110 tok/s on Qwen3 35B A3B (MoE, 3B active params) using ik_llama.cpp with an RTX 4070 Super 12GB, up from ~80 tok/s with llama.cpp after its MTP merge degraded performance. ik_llama.cpp's CPU offload optimizations appear significantly better tuned for hybrid GPU/CPU inference. Relevant for local inference enthusiasts running large MoE models on consumer hardware.

tutorial | guide

SOURCE

https://www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/

← back to feed