The most surprising thing about embedding models for RAG is how much they don’t care about sentence structure, prioritizing instead the sheer semantic density of words within a given chunk.

Let’s see nomic-embed-text-v1.5 in action, generating embeddings for a few lines of text. We’ll use Ollama’s API directly to get a feel for the raw output.

curl http://localhost:11434/api/embeddings \
  -d '{
    "model": "nomic-embed-text",
    "prompt": "The quick brown fox jumps over the lazy dog."
  }'

This will return a JSON object with a embedding field, which is a list of floating-point numbers. Each number represents a dimension in the model’s learned vector space. For nomic-embed-text-v1.5, this vector is 768 dimensions long.

{
  "embedding": [
    -0.018310546875,
    0.0244140625,
    -0.0146484375,
    // ... 765 more dimensions
  ]
}

Now, let’s try mxbai-embed-large.

curl http://localhost:11434/api/embeddings \
  -d '{
    "model": "mxbai-embed-large",
    "prompt": "The quick brown fox jumps over the lazy dog."
  }'

This model, mxbai-embed-large, produces a 1024-dimensional embedding vector. The values will be different from nomic-embed-text-v1.5, reflecting the different training data and architecture.

{
  "embedding": [
    -0.0321044921875,
    0.045654296875,
    -0.02874755859375,
    // ... 1021 more dimensions
  ]
}

The core problem these models solve in RAG is bridging the gap between natural language queries and structured knowledge bases. When a user asks a question, say "What are the symptoms of a cold?", a traditional keyword search might miss documents that describe "runny nose," "sore throat," or "cough" without using the exact phrase "symptoms of a cold." Embedding models represent both the query and the document chunks as dense numerical vectors. The magic happens when we calculate the similarity between these vectors – typically using cosine similarity. Documents whose embeddings are "close" to the query embedding in this high-dimensional space are considered semantically relevant, even if they don’t share keywords.

Internally, these models are large neural networks, usually transformers, trained on massive datasets of text. During training, they learn to map words and phrases to vectors such that words with similar meanings are mapped to nearby points in the vector space. For RAG, we typically use "bi-encoder" architectures. This means we pass the text (either the query or a document chunk) through the same model to get its embedding. The model has no explicit knowledge of the "query" or "document" role; it just produces a vector representation of the input text. The similarity calculation happens after the embeddings are generated.

The exact levers you control are primarily the choice of embedding model and how you chunk your documents. Different models have different strengths and weaknesses. nomic-embed-text-v1.5 is known for its speed and efficiency, making it a good choice when you need to embed large volumes of data quickly or have limited computational resources. mxbai-embed-large, on the other hand, often provides higher accuracy and better semantic understanding due to its larger size and potentially more diverse training data, but at the cost of higher latency and resource usage. Chunking strategy is critical: too small, and you lose context; too large, and the embedding might become too diluted with irrelevant information. A common starting point is a chunk size of 500-1000 tokens, often with some overlap between chunks to ensure no information is lost at the boundaries.

When you embed a document, the model is essentially summarizing its semantic essence into a fixed-size vector. What this means mechanically is that if you have a document with 10,000 tokens, and you chunk it into 10 pieces of 1000 tokens each, the model produces 10 distinct vectors, each representing the semantic content of its respective chunk. The overall semantic representation of the document is then an aggregation of these individual chunk embeddings, usually by averaging them. This process inherently smooths out nuances; the model isn’t designed to preserve the exact ordering or combinatorial logic of words across large distances within a chunk, but rather to capture the predominant themes.

The next concept you’ll likely explore is retrieval strategies beyond simple cosine similarity, such as hybrid search or re-ranking.

Want structured learning?

Take the full Ollama course →