[arXiv]score: 0.22
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
April 30, 2026
RaMP introduces a routing-aware MoE inference framework that selects optimal CUDA kernel configurations at runtime using both batch size and live expert routing histograms, closing a 10-70% throughput gap that static batch-size-only dispatch leaves on the table. A four-parameter wave cost model, fitted in just 10-24 minutes of one-time profiling, achieves 0.93% mean regret versus exhaustive search across 134-268 polymorphic CuTe DSL configurations. The hardware-agnostic performance-region analysis correctly predicts optimal configurations on all 8 tested GPU architectures, including 3 unseen, and delivers 1.22x kernel speedup with 1.14x gains on Alpha-MoE requiring zero source modification. MoE serving teams running Mixtral-class or larger sparse models in production should prioritize this immediately.
cs.LGcs.AIcs.DC