Major multimodal Google DeepMind

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Madhuri Shanbhogue, Zhe Li, Shanfeng Zhang, Gustavo Hernández Ábrego, Shih-Cheng Huang, Aashi Jain

Published
May 26, 2026 — 17:07 UTC

Problem
This paper presents Gemini Embedding 2, addressing the gap in existing multimodal embedding models that typically struggle to unify video, audio, image, and text modalities into a single representation space. The work is a preprint and has not yet undergone peer review, indicating that the findings should be interpreted with caution. The authors aim to enhance the generalization capabilities of embeddings across diverse tasks, which is a critical challenge in the field of multimodal machine learning.

Method
The core technical contribution of Gemini Embedding 2 is its architecture, which integrates a unified representation space for multiple modalities. The model employs large-scale contrastive learning within a multi-task, multi-stage training framework. Specifics regarding the architecture are not disclosed, but the training process involves leveraging interleaved inputs from video, audio, image, and text. The authors utilize a diverse dataset to ensure robust training, although exact compute resources are not specified. The model’s design allows it to produce embeddings that are effective for both unimodal and cross-modal retrieval tasks.

Results
Gemini Embedding 2 achieves state-of-the-art performance on several key benchmarks. Notable results include a Recall@1 (R@1) score of 62.9 on the MSCOCO dataset, an NDCG@10 score of 68.8 on Vatex, and a score of 69.9 on the MTEB multilingual benchmark. Additionally, it scores 84.0 on the MTEB Code benchmark. These results demonstrate significant improvements over specialized models, indicating that Gemini Embedding 2 can effectively handle a wide range of tasks, including retrieval across different modalities.

Limitations
The authors acknowledge that while Gemini Embedding 2 shows strong performance, there may be limitations in its ability to handle highly specialized tasks that require deep domain knowledge. They do not discuss potential issues related to the scalability of the model or the computational efficiency during inference. Furthermore, the reliance on large-scale datasets for training may limit its applicability in scenarios with limited data availability.

Why it matters
The implications of Gemini Embedding 2 are substantial for downstream applications such as retrieval-augmented generation (RAG), recommendation systems, and search functionalities. Its robust zero-shot performance across diverse fields—including astronomy, bioscience, fine arts, and culinary arts—positions it as a versatile tool for various applications, potentially reducing the need for task-specific models. This work could pave the way for more integrated approaches in multimodal machine learning, encouraging further exploration into unified embedding strategies.

Authors: Madhuri Shanbhogue, Zhe Li, Shanfeng Zhang, Gustavo Hernández Ábrego, Shih-Cheng Huang, Aashi Jain, Daniel Salz, Sonam Goenka et al.
Source: arXiv:2605.27295
URL: https://arxiv.org/abs/2605.27295v1

Turing Wire

By Turing Wire editorial staff · May 26, 2026 · Editorial standards →

Source: arXiv cs.CV