Multimodal Embeddings Explained for Product Managers

Why Text Embeddings Alone Are Not Enough

Text embeddings assign a vector to each piece of text so that semantically similar text ends up with similar vectors. The sentence "I want a pair of running shoes" and the product description "lightweight athletic footwear for distance running" end up close together in vector space, even without any shared keywords. This is how semantic search works and why it outperforms keyword matching for most user queries.

The limitation is that text embeddings only understand text. If a user uploads a photo of a shoe they want to find, a text embedding model has nothing to work with. If your product catalog contains images without rich text descriptions, semantic search over text embeddings misses most of the signal.

Image-only queries

A user uploads a screenshot of a product they saw and wants to find something similar. There is no text query to embed. Text-only retrieval fails completely.

Sparse text descriptions

Many product catalogs, document libraries, and content databases have images with minimal or missing text descriptions. Relying on text embeddings means those items are effectively invisible to search.

Cross-modal grounding

A user asks 'show me examples of what you mean' after a text explanation. Connecting text descriptions to illustrative images requires understanding both modalities in the same space.

Visual quality signals

Ranking search results by visual similarity (color, style, composition) requires encoding visual features, not just text metadata. Text embeddings cannot capture 'same visual aesthetic as this reference image.'

How Multimodal Embeddings Work: CLIP and the Shared Space

The core idea behind multimodal embeddings is a joint embedding space: a shared vector space where both images and text are encoded, such that semantically related content across modalities ends up nearby. A photo of a golden retriever and the text "golden retriever" both map to similar vectors. A photo of Paris at night and the text "Eiffel Tower illuminated" end up close together.

CLIP (Contrastive Language-Image Pretraining), published by OpenAI in 2021, is the foundational model that popularized this approach. CLIP trains two encoders in parallel: an image encoder (based on a Vision Transformer) and a text encoder. During training, CLIP sees 400 million image-text pairs from the internet and learns to push matching pairs together in vector space while pushing non-matching pairs apart. This contrastive training objective is what creates the alignment between modalities.

The image encoder

A Vision Transformer (ViT) divides the input image into a grid of patches, converts each patch into a token, and processes them through transformer attention layers. The final output is a single vector representing the image's semantic content.

The text encoder

A standard transformer language model processes the text description and produces a vector representing its semantic meaning. CLIP's text encoder is similar to a smaller GPT model.

Contrastive training

For a batch of N image-text pairs, CLIP computes all N squared similarity scores. Correct pairs are pushed to similarity score of 1; all other combinations are pushed toward 0. This forces the model to align semantically related content across modalities.

Zero-shot generalization

Because CLIP learned from natural language descriptions of images rather than fixed category labels, it generalizes to new categories it was never explicitly trained on. You can query 'a photo of a mango cake' and get meaningful results without ever fine-tuning on mango cakes.

Since CLIP, the field has produced stronger variants: SigLIP (from Google, uses a sigmoid loss instead of softmax, performs better with smaller batches), ALIGN (trained on a billion noisy pairs), and BLIP-2 (adds a lightweight adapter between frozen image and text encoders, enabling instruction-following over images). Most commercial API providers expose CLIP-family models for embedding generation.

The Retrieval Architecture Behind Cross-Modal Features

Understanding the joint embedding space is half the picture. The other half is the retrieval system that uses those embeddings at product scale. The architecture for cross-modal retrieval follows the same pattern as text semantic search, with a few additional considerations.

Step 1: Offline indexing

Encode every item in your corpus (images, documents, products) using the multimodal encoder. Store the resulting vectors in a vector database (Pinecone, Weaviate, pgvector, or Qdrant). This is done once and updated incrementally as new items are added.

PM note: Indexing latency and cost scale with corpus size. 1M images at 512-dimensional vectors requires roughly 2GB of storage plus the vector database overhead. Budget accordingly.

Step 2: Query encoding

When a user submits a query (text or image), encode it with the same multimodal encoder in real time. This produces a query vector in the same shared space as your indexed items.

PM note: Query encoding latency is typically 20 to 100ms depending on the encoder model and hardware. This is your critical path for search latency.

Step 3: Approximate nearest neighbor (ANN) search

Find the k most similar vectors in your index using cosine or dot-product similarity. ANN algorithms (HNSW, IVF) make this fast even across millions of items by trading a small precision loss for much lower compute.

PM note: ANN search returns results in 10 to 50ms for corpora up to tens of millions of items. For larger corpora or stricter latency requirements, pre-filtering and hierarchical indexing are needed.

Step 4: Re-ranking (optional)

The top-k ANN results can be re-ranked using a more expensive but more accurate model (a cross-encoder that jointly processes the query and each candidate). Re-ranking improves precision at the cost of latency.

PM note: Re-ranking adds 50 to 200ms and is typically only worth it when precision on the top 3 to 5 results is critical (e-commerce purchase flows, medical image retrieval) vs. exploratory search.

Go Deeper on AI Architecture in the Masterclass

The AI PM Masterclass covers how retrieval, embedding systems, and multimodal architecture decisions translate directly into product decisions. Taught live by a Salesforce Sr. Director PM.

Product Applications That Use Multimodal Embeddings

Multimodal embeddings power a wider range of product features than most PMs realize. Any feature that needs to connect images and text, or find visually similar content, is likely using them.

Visual search (image-to-catalog)

A user uploads a photo of a dress they saw on a street. Your system encodes the image, searches your product catalog with that vector, and returns visually similar items. Pinterest Lens, Amazon visual search, and Google Lens all work this way.

Watch out: Works best when your catalog items are also encoded with the same model. Mixed encoding models cause poor cross-modal alignment.

Text-to-image retrieval

A user types 'product showing a family using the app on vacation' and your system returns matching stock photos or UGC. Marketing teams use this to surface content by semantic description rather than keyword-tagged metadata.

Watch out: Quality degrades when the text description is very abstract or includes nuanced cultural or aesthetic concepts the model was not trained on.

Multimodal document understanding

An enterprise RAG system that retrieves from PDFs containing both text and charts. Multimodal embeddings encode the chart images alongside the surrounding text so queries about revenue trends surface the correct chart even without alt-text.

Watch out: Most production RAG systems still use text-only embeddings. Adding vision adds cost and complexity but significantly improves recall for visually dense documents.

Design feedback automation

A PM tool that analyzes competitor screenshots and UI mockups to cluster visual design patterns, find similar components, or flag accessibility issues. Encodes UI screenshots and retrieves by visual similarity.

Watch out: UI screenshot embeddings often require fine-tuning or domain-specific training because general CLIP models were not optimized on interface design corpora.

Content moderation at scale

Matching uploaded user images against a database of known violating content using embedding similarity rather than exact-hash matching. Catches near-duplicate violating content that perceptual hashing misses.

Watch out: False positive rates require careful threshold tuning. Embedding similarity is a first-pass filter, not a final decision.

Cross-modal recommendation

Recommending articles or videos based on what images a user has engaged with, or vice versa. The shared embedding space allows cross-modal affinity modeling without separate recommendation models per content type.

Watch out: Requires unified user history across modalities, which has privacy and data pipeline implications.

Key Implementation Decisions for Product Managers

When your team is evaluating whether to build a feature on multimodal embeddings, these are the decisions that determine cost, quality, and timeline.

Use a pretrained API vs. fine-tune your own encoder

For most product use cases, start with a pretrained CLIP-family model via API (OpenAI embeddings, Google Vertex AI multimodal, or open-source models via Together AI or Replicate). Fine-tuning is only worth it when your domain has visual concepts underrepresented in web-scraped training data (specialized medical imagery, proprietary industrial equipment, niche fashion categories). Fine-tuning requires labeled image-text pairs and 2 to 4 weeks of engineering time minimum.

Embedding dimensionality

Most CLIP-family models output 512 or 1024-dimensional vectors. Higher dimensions capture more nuance but cost more to store and search. For retrieval over fewer than 1 million items, 512 dimensions with HNSW indexing is sufficient. At 10M+ items or sub-10ms latency requirements, consider dimensionality reduction (PCA to 256 dimensions) or product quantization.

Hybrid retrieval (multimodal + text metadata)

Pure embedding retrieval sometimes misses items that are semantically relevant but visually different. Combining multimodal embedding similarity with structured metadata filters (price range, category, availability) typically improves end-to-end precision by 15 to 25% in e-commerce settings. Build a re-ranking layer that blends both signals.

Evaluation and quality metrics

Evaluate multimodal retrieval with recall@k (what fraction of relevant items appear in the top k results) and mean reciprocal rank (how highly ranked the first relevant result is). Human evaluation is still necessary for qualitative judgment on visual similarity. Run evals across diverse query types: text-to-image, image-to-text, and same-modality retrieval.

What Is Coming Next in Multimodal Embeddings

The multimodal embedding space is moving faster than most other areas of applied AI. Three trends are shaping where this capability will be in 12 to 24 months.

Video and temporal embeddings

Current CLIP-family models treat images as static snapshots. The next generation encodes video clips as temporal sequences, enabling queries like 'find scenes where a user looks frustrated' or 'retrieve product demos showing the checkout flow.' Models like VideoCLIP and InternVideo are early versions.

Audio-visual embeddings

Models that jointly encode audio, video, and text are enabling features like finding video segments by spoken content combined with visual context. Relevant for any product in media, education, or enterprise knowledge management.

Instruction-tuned multimodal embeddings

Standard CLIP-family models embed everything the same way regardless of the downstream task. Instruction-tuned variants (like E5-V) adapt the encoding based on the task description, producing better embeddings for classification versus retrieval versus clustering without separate models.

On-device multimodal embedding models

Quantized CLIP variants are small enough to run inference on mobile devices and edge hardware. This enables offline visual search without API calls, which matters for latency-sensitive or connectivity-constrained use cases.