[r/LocalLLaMA]score: 0.20

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

May 9, 2026

A llama.cpp fork called BeeLlama.cpp integrates DFlash attention and TurboQuant quantization with MTP speculative decoding, enabling Qwen 3.6 27B at Q5 precision with 200k context on a single RTX 3090. It achieves 135 tokens/sec peak, 2-3x over baseline llama.cpp, with vision and reasoning intact. Windows-native, no VRAM overflow. Local inference enthusiasts running large models on consumer GPUs should evaluate this immediately.

resources

SOURCE

https://www.reddit.com/r/LocalLLaMA/comments/1t88zvv/beellamacpp_advanced_dflash_turboquant_with/

← back to feed

BeeLlama.cpp: advanced DFlash &amp; TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)