[r/LocalLLaMA]score: 0.25
Needle: We Distilled Gemini Tool Calling Into a 26M Model
May 12, 2026
Needle is an open-sourced 26M parameter tool-calling model distilled from Gemini, achieving 6000 tok/s prefill and 1200 tok/s decode on consumer hardware. It reframes function-calling as retrieval-and-assembly rather than reasoning, using cross-attention instead of heavy FFN layers. Critical for on-device agentic pipelines where massive LLMs are impractical.
new model