[arXiv]score: 0.12
EvA Dual-Path Architecture Tops Open-Source Audio Perception Benchmarks
May 29, 2026
A dual-path architecture called EvA addresses the evidence bottleneck in large audio language models by using hierarchical aggregation and non-compressive time-aligned fusion to preserve acoustic evidence before reasoning. Trained on 54K event-ordered captions and 500K evidence-grounded QA pairs, it achieves best open-source perception scores on MMAU, MMAR, and MMSU under zero-shot conditions.
cs.SDcs.AI
HOW THIS AFFECTS YOU
●
builderEvA-Perception dataset and architecture are open-source reference points if you are building audio understanding pipelines that need reliable acoustic event grounding.
●
researcherThe evidence bottleneck framing and dual-path fusion method offer a concrete architectural direction for improving perception-limited audio-language models.