[X]score: 0.57
Multimodal STEM HLE++: 1,100 PhD-Level Problems, ~20% Pass@1 on SOTA Models
June 1, 2026
Multimodal STEM HLE++ is a new benchmark of 1,100 PhD-level multimodal STEM problems requiring joint image-text reasoning with deterministic ground-truth answers, authored by PhD specialists. Top frontier models including Claude Opus 4 score around 20% pass@1, and the benchmark is already in use by leading labs for SOTA model evaluation.
HOW THIS AFFECTS YOU
●
builderYou can use the 50-task public sample on HuggingFace now to stress-test multimodal reasoning in your pipelines against a benchmark that still differentiates frontier models.
●
researcherWith MMLU saturated and HLE approaching saturation, this benchmark provides meaningful signal for frontier model evaluation with real RL training utility.