[HUGGINGFACE]score: 0.48

PEEK Distills Frame Relevance Rankings Into Lightweight Video Sampling Model

May 28, 2026

PEEK trains a lightweight temporal model to predict caption-conditioned frame relevance by distilling rankings from a stronger teacher, replacing uniform sampling in video captioning pipelines. It outperforms state-of-the-art adaptive sampling methods on ActivityNet Captions and MSR-VTT while reducing compute.

paper

HOW THIS AFFECTS YOU

●

builderYou can swap uniform frame sampling for PEEK in video captioning pipelines to improve quality without the compute cost of existing adaptive methods.

●

researcherThe distillation approach for temporal relevance ranking is applicable beyond captioning to any video-language task bottlenecked by frame selection.

SOURCE

https://huggingface.co/papers/2605.31029

← back to feed