[r/LocalLLaMA]score: 0.36

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

April 25, 2026

**Summary:** A community member has demonstrated Qwen3.6-27B running at approximately 80 tokens per second with a 218k context window on a single RTX 5090 GPU using vLLM 0.19.1rc1, enabled by an NVFP4 quantized model with Multi-Token Prediction (MTP) available on Hugging Face. This matters because it shows a 27B-parameter model can be served at practical inference speeds on consumer/prosumer single-GPU hardware using FP4 quantization, which typically requires multi-GPU setups at full precision. The same vLLM serving recipe previously used for Qwen3.5-27B transfers directly to the new model, suggesting a reproducible local deployment path for practitioners running RTX 5090 hardware.

resources

SOURCE

https://www.reddit.com/r/LocalLLaMA/comments/1sv8eua/qwen3627b_at_80_tps_with_218k_context_window_on/

← back to feed