●builderIf you're selecting models based on static benchmark scores, you may be optimizing for training data overlap rather than actual task performance in production.
●researcherWorth watching because conflating retrieval performance with intelligence inflates perceived progress on standard evals — dynamic or held-out benchmarks are needed for valid capability measurement.