[r/LocalLLaMA]score: 0.18

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM

May 6, 2026

A community benchmark confirms Qwen3-27B quantized to NVFP4 via compressed-tensors runs 200k context on a single RTX 5090 32GB using vLLM 0.20.1 dev build with FlashInfer attention and FP8-E4M3 KV cache. Multi-Token Prediction with 3 speculative tokens is enabled, improving decode throughput beyond standard autoregressive sampling. This matters for 5090 owners previously limited to GGUF or FP8 on 48GB cards, offering a validated long-context vLLM path on consumer 32GB hardware. No production-grade throughput numbers yet, but the configuration stack is fully reproducible.

discussion

SOURCE

https://www.reddit.com/r/LocalLLaMA/comments/1t5dya8/qwen36_27b_nvfp4_mtp_on_a_single_rtx_5090_200k/

← back to feed