[HUGGINGFACE]score: 0.89
Gemini Embedding 2 Achieves SOTA Multimodal Retrieval Across Video, Audio, Image, Text
May 25, 2026
Gemini Embedding 2 embeds interleaved video, audio, image, and text in a unified space using multi-task contrastive training, scoring 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual, and 84.0 on MTEB Code.
paper
HOW THIS AFFECTS YOU
●
builderYou can now use a single embedding API for cross-modal retrieval across video, audio, image, and text without maintaining separate embedding pipelines per modality.
●
researcherState-of-the-art numbers across unimodal, cross-modal, and multimodal retrieval benchmarks set a new baseline for unified embedding model evaluation.
●
founderWorth watching because a unified multimodal embedding model from Google compresses what previously required multiple specialized models into one API, shifting the build-vs-buy calculus.