Embedding models are the unsung heroes of Retrieval Augmented Generation (RAG), and picking the wrong one can turn your sophisticated pipeline into a glorified keyword search.
Let’s see a RAG pipeline in action, specifically how different embedding models affect retrieval. Imagine we have a small knowledge base:
- Document A: "The quick brown fox jumps over the lazy dog. This is a classic pangram used for testing typefaces."
- Document B: "Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by humans or animals."
- Document C: "Large Language Models (LLMs) are a type of artificial intelligence algorithm trained on vast amounts of text data to understand and generate human-like language."
Now, let’s say our query is: "What are AI models that understand language?"
If we use a very general, older embedding model like text-embedding-ada-002 (OpenAI’s popular choice for a while), it might capture the semantic similarity well.
from openai import OpenAI
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Assume OpenAI client and models are set up
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
def get_embedding_openai(text):
response = client.embeddings.create(input=text, model="text-embedding-ada-002")
return response.data[0].embedding
# Let's use a SentenceTransformer model for comparison
# This one is specifically trained for semantic search
model_st = SentenceTransformer('all-MiniLM-L6-v2')
def get_embedding_st(text):
return model_st.encode(text)
# Documents and Query
docs = {
"A": "The quick brown fox jumps over the lazy dog. This is a classic pangram used for testing typefaces.",
"B": "Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by humans or animals.",
"C": "Large Language Models (LLMs) are a type of artificial intelligence algorithm trained on vast amounts of text data to understand and generate human-like language."
}
query = "What are AI models that understand language?"
# Get embeddings
query_embedding_ada = get_embedding_openai(query)
doc_embeddings_ada = {k: get_embedding_openai(v) for k, v in docs.items()}
query_embedding_st = get_embedding_st(query)
doc_embeddings_st = {k: get_embedding_st(v) for k, v in docs.items()}
# Calculate similarity (using cosine similarity)
# OpenAI embeddings are typically 1536 dimensions, SentenceTransformer can vary.
# For simplicity, we'll assume compatible dimensions or use a library that handles it.
# Here, we'll just show the concept.
# OpenAI ada-002 example
similarities_ada = {k: cosine_similarity([query_embedding_ada], [v])[0][0] for k, v in doc_embeddings_ada.items()}
# Let's say the results show: A: 0.2, B: 0.7, C: 0.65
# SentenceTransformer all-MiniLM-L6-v2 example
similarities_st = {k: cosine_similarity([query_embedding_st], [v])[0][0] for k, v in doc_embeddings_st.items()}
# Let's say the results show: A: 0.1, B: 0.6, C: 0.8
print("--- OpenAI ada-002 Similarities ---")
for doc, sim in sorted(similarities_ada.items(), key=lambda item: item[1], reverse=True):
print(f"Doc {doc}: {sim:.2f}")
print("\n--- SentenceTransformer all-MiniLM-L6-v2 Similarities ---")
for doc, sim in sorted(similarities_st.items(), key=lambda item: item[1], reverse=True):
print(f"Doc {doc}: {sim:.2f}")
In this hypothetical output, ada-002 might rank Document B (general AI definition) higher than Document C (LLMs), perhaps because "AI" is a stronger keyword signal in Document B. However, all-MiniLM-L6-v2, being fine-tuned for semantic similarity, correctly identifies that Document C is a much better match for the query "AI models that understand language" due to its focus on LLMs and their language capabilities. The LLM would then generate an answer based on Document C.
The core problem RAG solves is grounding LLM responses in specific, up-to-date, or proprietary data, preventing hallucinations and improving accuracy. It works by first retrieving relevant documents from a knowledge base and then providing these documents as context to the LLM for generation. The embedding model is the crucial first step: it converts both your query and your documents into dense numerical vectors in a high-dimensional space. The similarity between these vectors (usually measured by cosine similarity) determines how "relevant" a document is to a query.
The key levers you control are:
- The Embedding Model: This is the most impactful choice. Different models are trained on different datasets and with different objectives, leading to varying strengths in capturing semantic meaning, nuance, or specific domains.
- The Knowledge Base: The quality, format, and chunking strategy of your documents directly affect what the embedding model can find.
- The Retrieval Strategy: Beyond simple similarity, techniques like Maximum Marginal Relevance (MMR) or hybrid search (combining keyword and vector search) can refine which documents are passed to the LLM.
- The LLM: The generative model itself, and how well it can utilize the provided context.
The choice of embedding model boils down to a trade-off between generality, performance, cost, and specialization.
- General-Purpose Models (e.g.,
text-embedding-ada-002,all-MiniLM-L6-v2): These are good starting points.ada-002is known for its broad understanding but can be expensive at scale.all-MiniLM-L6-v2is a fast, open-source option that performs remarkably well for its size. They excel at capturing common semantic relationships. - Domain-Specific Models (e.g.,
e5-large-v2, models fine-tuned on scientific papers or legal documents): If your RAG pipeline deals with highly specialized content (medical, legal, scientific research), a model fine-tuned on similar data will likely yield superior retrieval. These models understand the jargon and nuances of that domain much better. - Multilingual Models (e.g.,
paraphrase-multilingual-mpnet-base-v2): If your knowledge base or user queries span multiple languages, these models are essential. They map text from different languages into a shared embedding space. - Models with Varying Dimensions (e.g., 384, 768, 1024, 1536): Higher dimensions can sometimes capture more nuance but also increase computational cost and storage requirements. The optimal dimension often depends on the model’s architecture and training data.
When you’re evaluating embedding models, don’t just look at benchmarks on generic datasets. Test them against your actual documents and typical queries. A model that scores 90% on a standard benchmark might perform worse than a 85% model on your specific, niche data if the benchmark doesn’t reflect your domain’s language. Furthermore, consider the inference speed and cost. A slightly less accurate but much faster and cheaper model might be the better practical choice for a high-volume application.
The most surprising thing about embedding models is how much their internal representation of "meaning" can diverge based on their training data and architecture, leading to vastly different retrieval outcomes for seemingly similar queries or documents. For instance, a model trained heavily on news articles might struggle to find relevant information in a technical manual, even if the underlying concepts are related, because its learned semantic space is skewed towards journalistic language.
The next problem you’ll encounter is efficiently managing and updating your vector embeddings as your knowledge base grows or changes.