HACKOBAR_item
[r/LocalLLaMA]score: 0.16

Exaggerated PCI-E bandwidth concerns?

May 6, 2026
QuantTrio releases gemma-4-31B-it-AWQ-6Bit on Hugging Face, a 6-bit AWQ quantization of Google's Gemma 4 31B instruction-tuned model. Real-world benchmarking on dual RTX 5060 Ti 16GB via vLLM TP=2 shows peak PCIe consumption of only 3-4 GB/s during 32k-context prefill, just 40-50% of a PCIe 4.0 x4 link. This challenges the prevailing assumption that asymmetric consumer PCIe topologies bottleneck multi-GPU inference; compute and VRAM bandwidth appear to be the actual constraints. Practitioners building budget multi-GPU rigs should reconsider dismissing chipset-attached slots as unviable for tensor-parallel LLM serving.
discussion