[r/LocalLLaMA]score: 0.18
Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM
May 6, 2026
A community benchmark confirms Qwen3-27B quantized to NVFP4 via compressed-tensors runs 200k context on a single RTX 5090 32GB using vLLM 0.20.1 dev build with FlashInfer attention and FP8-E4M3 KV cache. Multi-Token Prediction with 3 speculative tokens is enabled, improving decode throughput beyond standard autoregressive sampling. This matters for 5090 owners previously limited to GGUF or FP8 on 48GB cards, offering a validated long-context vLLM path on consumer 32GB hardware. No production-grade throughput numbers yet, but the configuration stack is fully reproducible.
discussion