[arXiv]score: 0.12

EvA Dual-Path Architecture Tops Open-Source Audio Perception Benchmarks

May 29, 2026

A dual-path architecture called EvA addresses the evidence bottleneck in large audio language models by using hierarchical aggregation and non-compressive time-aligned fusion to preserve acoustic evidence before reasoning. Trained on 54K event-ordered captions and 500K evidence-grounded QA pairs, it achieves best open-source perception scores on MMAU, MMAR, and MMSU under zero-shot conditions.

cs.SDcs.AI

HOW THIS AFFECTS YOU

●

builderEvA-Perception dataset and architecture are open-source reference points if you are building audio understanding pipelines that need reliable acoustic event grounding.

●

researcherThe evidence bottleneck framing and dual-path fusion method offer a concrete architectural direction for improving perception-limited audio-language models.

SOURCE

https://arxiv.org/abs/2603.27667

← back to feed