[HUGGINGFACE]score: 0.89

Gemini Embedding 2 Achieves SOTA Multimodal Retrieval Across Video, Audio, Image, Text

May 25, 2026

Gemini Embedding 2 embeds interleaved video, audio, image, and text in a unified space using multi-task contrastive training, scoring 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual, and 84.0 on MTEB Code.

paper

HOW THIS AFFECTS YOU

●

builderYou can now use a single embedding API for cross-modal retrieval across video, audio, image, and text without maintaining separate embedding pipelines per modality.

●

researcherState-of-the-art numbers across unimodal, cross-modal, and multimodal retrieval benchmarks set a new baseline for unified embedding model evaluation.

●

founderWorth watching because a unified multimodal embedding model from Google compresses what previously required multiple specialized models into one API, shifting the build-vs-buy calculus.

SOURCE

https://huggingface.co/papers/2605.27295

← back to feed