[r/LocalLLaMA]score: 0.22

CUDA: reduce MMQ stream-k overhead by JohannesGaessler · Pull Request #22298 · ggml-org/llama.cpp

April 25, 2026

**llama.cpp PR #22298 reduces overhead in the stream-k matrix multiplication quantization (MMQ) kernel on CUDA, specifically targeting prompt processing (prefill) performance for Mixture-of-Experts models.** Benchmark results in the linked comment show measurable prefill throughput improvements for MoE architectures (e.g., models like DeepSeek or Mixtral variants) running quantized inference on CUDA GPUs. This affects practitioners running local inference with llama.cpp on NVIDIA hardware, particularly those processing long prompts with MoE models where the stream-k scheduling overhead was a bottleneck in the quantized matrix multiply path.

news

SOURCE

https://github.com/ggml-org/llama.cpp/pull/22298

← back to feed