●researcherYou should exercise caution when using LLM-as-a-judge for evaluating models in non-English or low-resource languages.
●policyThis highlights the need for standardized, human-validated benchmarks for global AI safety and performance monitoring.