Gemini Embedding 2
Gemini Embedding 2 maps text, images, video, audio and documents into one embedding space for multimodal retrieval and classification.
What is Gemini Embedding 2?
Gemini Embedding 2 is Google’s first fully multimodal embedding model built on the Gemini architecture. It maps text, images, video, audio, and documents into a single embedding space, enabling retrieval and classification workflows across multiple media types.
The model is designed to handle semantics across more than 100 languages and can simplify multimodal pipelines by producing one type of vector representation for different kinds of input media.
Key Features
- Fully multimodal input coverage (text, images, video, audio, documents): Produces embeddings for multiple media types so applications can search and classify mixed-content data.
- Single, unified embedding space: Text, images, video, audio, and documents are embedded into the same space to support multimodal retrieval and analysis.
- Interleaved multimodal understanding in one request: Accepts multiple modalities together (for example, image + text) to capture relationships between different media.
- High-capacity modality limits: Supports up to 8192 input tokens for text, up to 6 images per request (PNG/JPEG), up to 120 seconds of video (MP4/MOV), and native audio embedding without intermediate transcription.
- Document embeddings from PDFs: Directly embed PDFs up to 6 pages rather than converting content to another format first.
- Flexible embedding output dimensions via Matryoshka Representation Learning (MRL): Supports scaling down from a default 3072 dimensions; Google recommends using 3072, 1536, or 768 for highest quality.
How to Use Gemini Embedding 2
Gemini Embedding 2 is available in public preview through the Gemini API and Vertex AI. To get started, use the interactive Gemini API and Vertex AI Colab notebooks provided by Google and then generate embeddings for your inputs.
For quick experimentation, Google also provides a lightweight multimodal semantic search demo where you can test how the embeddings work for retrieval-style tasks.
Use Cases
- Multimodal semantic search: Retrieve relevant items when users mix query modalities (for example, searching with text against an index that contains images, audio, or documents).
- Retrieval-Augmented Generation (RAG) across media: Use embeddings to fetch context from heterogeneous sources (documents plus media) and feed the retrieved content into downstream generation workflows.
- Sentiment analysis on mixed content: Embed media to support classification or clustering pipelines where the input may include text together with images or other modalities.
- Data clustering for heterogeneous datasets: Create a unified representation across media types to group related items even when they come from different formats.
- Document + media understanding for analytics: Embed PDFs (up to 6 pages) and combine them with other modalities in one embedding pipeline to support downstream search and classification.
FAQ
Is Gemini Embedding 2 only for text?
No. It is designed as a fully multimodal embedding model that maps text, images, video, audio, and documents into a single embedding space.
What platforms are supported for the public preview?
Google states Gemini Embedding 2 is available in public preview via the Gemini API and Vertex AI.
What input sizes does the model support?
The page lists modality limits including 8192 tokens for text, up to 6 images per request, up to 120 seconds of video (MP4/MOV), and up to 6 pages for PDFs. Audio is ingested natively for embedding.
Can I send multiple modalities together?
Yes. The model natively understands interleaved input, so you can pass multiple modalities (for example, image + text) in a single request.
Can the embedding dimensionality be changed?
Yes. Gemini Embedding 2 uses Matryoshka Representation Learning (MRL) to scale down from the default 3072 dimensions, with Google recommending 3072, 1536, and 768 for highest quality.
Alternatives
- Text-only embedding models: If your application uses only text, a text-only embedding model can be simpler; however, it won’t natively embed images, video, audio, or documents into the same space.
- Separate embeddings per modality: Some workflows use different embedding models for each modality and then combine results at retrieval time; this can be more complex than a single unified multimodal embedding space.
- Other multimodal embedding approaches: Alternative solution types may also produce embeddings for multiple media types, but Gemini Embedding 2 specifically emphasizes a single embedding space and interleaved multimodal requests.
- Index-and-retrieve pipelines using embedding providers: If you already have an embedding-based vector search setup, you can consider swapping in a multimodal embedding provider/model; the key difference is whether the model supports fully multimodal unified embeddings.
Alternatives
BookAI.chat
BookAI allows you to chat with your books using AI by simply providing the title and author.
skills-janitor
Audit, track usage, and compare your Claude Code skills with skills-janitor—nine focused slash commands and zero dependencies.
Struere
Struere is an AI-native operational system that replaces spreadsheet workflows with structured software—dashboards, alerts, and automations.
garden-md
Turn meeting transcripts into a structured, linked company wiki with local markdown and an HTML browser view. Sync from supported sources.
Falconer
Falconer is a self-updating knowledge platform for high-speed teams to write, share, and find reliable internal documentation and code context in one place.
AakarDev AI
AakarDev AI is a powerful platform that simplifies the development of AI applications with seamless vector database integration, enabling rapid deployment and scalability.