[arXiv]score: 0.24

EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness

May 5, 2026

EmoMM introduces a benchmark and inference-time fix for Multimodal Large Language Models struggling with emotion recognition when modalities conflict or go missing. The authors identify Video Contribution Collapse, where video tokens get marginalized due to redundancy, and counter it with CHASE, a training-free attention steering mechanism operating at the head level. MER researchers and production teams deploying audio-visual sentiment systems should prioritize this, as CHASE improves robustness without backbone retraining, outperforming prior modality-fusion baselines on conflict and missingness subsets.

cs.CVcs.AI

SOURCE

https://arxiv.org/abs/2605.01024

← back to feed