●builderIf you're shipping personalization features, this suggests your synthetic-data eval pipeline is likely overestimating real-world quality at every stage.
●researcherThe 550-conversation, 18,969-judgment dataset provides a human-grounded evaluation framework that exposes failure modes invisible in synthetic benchmarks.