●builderIf you're using LLM judge panels for automated eval pipelines, this finding means you're likely overestimating reliability; diversifying by model family alone is insufficient.
●researcherThe correlation structure across model families collapses multi-judge panel reliability — evaluation frameworks assuming independence need to be redesigned.