●researcherWorth watching because dollar-normalized eval reporting changes how thinking model benchmarks compare across open and closed systems.
●founderIf you're shipping on open models, reporting eval performance at equivalent dollar inference budgets rather than token counts flatters your numbers against GPT/Claude comparisons.