[HUGGINGFACE]score: 0.42

OMTG Benchmark Exposes MLLMs' Failure to Localize Multiple Video Segments per Query

June 3, 2026

One-to-Many Temporal Grounding (OMTG) formalizes the task of localizing multiple disjoint video segments for a single query, introducing Count Accuracy and Effective Temporal F1 metrics alongside a 56K-sample dataset. State-of-the-art MLLMs tuned for single-segment retrieval score near zero on this benchmark due to missing event cardinality perception.

HOW THIS AFFECTS YOU

●

builderIf you're building video search or highlight extraction tools, current MLLMs are demonstrably unfit for multi-event queries — this benchmark quantifies the gap.

●

researcherThe new metrics and dataset provide a rigorous evaluation surface for temporal reasoning in video MLLMs that existing benchmarks completely miss.

read original ↗huggingface.co

← back to feed