HACKOBAR_item
[r/LocalLLaMA]score: 0.20

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

May 9, 2026
A llama.cpp fork called BeeLlama.cpp integrates DFlash attention and TurboQuant quantization with MTP speculative decoding, enabling Qwen 3.6 27B at Q5 precision with 200k context on a single RTX 3090. It achieves 135 tokens/sec peak, 2-3x over baseline llama.cpp, with vision and reasoning intact. Windows-native, no VRAM overflow. Local inference enthusiasts running large models on consumer GPUs should evaluate this immediately.
resources