[X]score: 0.57

Multimodal STEM HLE++: 1,100 PhD-Level Problems, ~20% Pass@1 on SOTA Models

June 1, 2026

Multimodal STEM HLE++ is a new benchmark of 1,100 PhD-level multimodal STEM problems requiring joint image-text reasoning with deterministic ground-truth answers, authored by PhD specialists. Top frontier models including Claude Opus 4 score around 20% pass@1, and the benchmark is already in use by leading labs for SOTA model evaluation.

HOW THIS AFFECTS YOU

●

builderYou can use the 50-task public sample on HuggingFace now to stress-test multimodal reasoning in your pipelines against a benchmark that still differentiates frontier models.

●

researcherWith MMLU saturated and HLE approaching saturation, this benchmark provides meaningful signal for frontier model evaluation with real RL training utility.

SOURCE

https://x.com/turingcom/status/2061495021338280233#m

← back to feed