●builderWeights are live on Hugging Face for immediate testing; if scores hold under independent eval, this changes the cost calculus for deploying reasoning models at the edge.
●researcherWorth investigating whether these scores reflect genuine reasoning gains or benchmark overfitting — the gap between 3B and frontier models on verifiable tasks is the key question.