●builderYou can use this as a practical reference for writing high-performance CUDA kernels targeting current-gen hardware, with worked examples on the ops that matter most for LLM inference.
●researcherCovers Blackwell-specific memory spaces and execution units that aren't well-documented elsewhere, useful for anyone optimizing at the MLSys layer.