HACKOBAR_item
[r/LocalLLaMA]score: 0.20

Get faster qwen 3.6 27b

May 6, 2026
RDson on Hugging Face released Qwen3.6-27B-MTP-Q4_K_M-GGUF, a Q4_K_M quantized GGUF of Qwen 3.6 27B with Multi-Token Prediction support, achieving 50 tokens per second on a single RTX 3090 at 100K context via llama.cpp am17an branch. The setup uses speculative decoding with spec-draft-n-max 2, flash attention, Q4_0 KV cache quantization, and batch size 2048. Practitioners running consumer-grade single-GPU inference should note that spec-draft-n-max 3 exceeds 3090 VRAM headroom at high context lengths, making 2 the practical ceiling. This delivers competitive throughput versus standard autoregressive decoding on equivalent hardware, with 100K context covering most real-world workloads before requiring context compaction.
new model