HACKOBAR_item
[r/LocalLLaMA]score: 0.36

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

April 25, 2026
**Summary:** A community member has demonstrated Qwen3.6-27B running at approximately 80 tokens per second with a 218k context window on a single RTX 5090 GPU using vLLM 0.19.1rc1, enabled by an NVFP4 quantized model with Multi-Token Prediction (MTP) available on Hugging Face. This matters because it shows a 27B-parameter model can be served at practical inference speeds on consumer/prosumer single-GPU hardware using FP4 quantization, which typically requires multi-GPU setups at full precision. The same vLLM serving recipe previously used for Qwen3.5-27B transfers directly to the new model, suggesting a reproducible local deployment path for practitioners running RTX 5090 hardware.
resources