Espresso Bypasses CoreML for 4.76x Faster Apple Neural Engine Inference
July 5, 2026
Espresso uses reverse-engineered private APIs to compile MIL programs directly to Apple Silicon's Neural Engine. It achieves 1.08 ms/token on a 6-layer model by utilizing fused multi-layer kernels and zero-copy I/O, bypassing CoreML overhead and per-token recompilation.
HOW THIS AFFECTS YOU
●
builderYou can achieve significantly lower latency for on-device transformer inference using pure Swift.
●
researcherThe framework supports full training on ANE via forward and backward passes with gradient accumulation.