[HN]score: 0.32

Tiny-vLLM: C++ and CUDA LLM Inference Engine With Full Course Included

May 29, 2026

Tiny-vLLM is a from-scratch LLM inference engine in C++ and CUDA targeting Llama 3.2 1B Instruct, implementing PagedAttention, FlashAttention-like kernels, continuous batching, and KV cache. It ships alongside a structured course walking through each component, making it a rare combination of production-relevant code and educational resource.

HOW THIS AFFECTS YOU

●

builderYou can use this as a reference implementation to understand or customize low-level inference stack components, particularly CUDA kernel design for attention and batching.

●

researcherProvides a clean, annotated implementation of PagedAttention and online softmax from scratch, useful for understanding or modifying inference-time compute behavior.

SOURCE

https://github.com/jmaczan/tiny-vllm

← back to feed