[r/LocalLLaMA]score: 0.18

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

May 8, 2026

A practitioner achieved 80-87 t/s on Qwen3.6-27B at 262K context on a single RTX 4090 24GB by combining Multi-Token Prediction (MTP) with TurboQuant's TBQ4_0 lossless 4.25 bpw KV cache quantization. The Q4_K_M model uses grafted MTP heads with 73% draft acceptance rate, nearly doubling throughput from a baseline 43 t/s. Long-context inference on consumer hardware just got a serious upgrade worth replicating.

resources

SOURCE

https://www.reddit.com/r/LocalLLaMA/comments/1t7kyju/got_mtp_turboquant_running_qwen3627b_80_ts_at/

← back to feed