●researcherWorth watching as a potential alternative to synthetic evals — grounding pre-release behavior prediction in real conversation distributions could improve benchmark-to-deployment correlation.
●policyThis changes how pre-deployment safety evaluation can be framed, using empirical usage data rather than constructed red-team scenarios.