[r/LocalLLaMA]score: 0.18

Throughput and TTFT comparisons of Qwen 3.6 27B, Qwen 3.6 35B A3B and Gemma 4 models on H100

April 25, 2026

A Reddit practitioner benchmarked 8 recent small-to-mid-size models on a single H100 80GB using vLLM 0.19.1, measuring throughput (tokens/sec) and TTFT (ms) across concurrency levels of 1, 4, 8, and 16 with 128-token input/output sequences. Gemma 4 E2B-it reached 3,180 TPS at 16 concurrent requests versus 226 TPS for Gemma 4 31B-dense — a ~14x throughput advantage from a model roughly 1/15th the parameter count. For teams deploying on single-GPU inference budgets, this suggests Gemma 4's sparse MoE E2B variant is the dominant choice for throughput-sensitive workloads, while larger dense models like Qwen3 35B A3B occupy a different cost-quality tradeoff that may not justify the capacity constraints on a single H100.

discussion

SOURCE

https://i.redd.it/crchmurg5bxg1.png

← back to feed