Worst-Dimension Optimization Targets Weakest Reasoning Step in Multimodal PRMs
June 9, 2026
Current Process Reward Models for multimodal reasoning average across dimensions like visual grounding and logic consistency, letting strong dimensions mask failures in weak ones. This paper proposes optimizing the worst-performing dimension per reasoning step rather than the mean, improving overall reasoning validity.
HOW THIS AFFECTS YOU
●
researcherThe worst-dimension objective is a drop-in modification to PRM training that could improve robustness on multimodal benchmarks where aggregate scores hide specific failure modes.