[r/LocalLLaMA]score: 0.11

llama.cpp VRAM Tricks: Free Space for Larger Context Windows

June 17, 2026

Running Qwen3-27B at Q5_K_XL with 150k context on a 3090 eGPU, key VRAM savings come from --no-mmproj-offload (saves ~1GB by moving vision projector to CPU) and reduced KV cache precision via --cache-type-k/v flags. Combined with --no-mmap and --mlock, these flags can meaningfully extend usable context without upgrading hardware.

HOW THIS AFFECTS YOU

●

builderYou can squeeze larger context windows on consumer GPUs by combining --no-mmproj-offload and quantized KV cache flags in llama.cpp without any code changes.

read original ↗reddit.com

← back to feed