OpenAI Embeddings can make your search results worse if you don’t understand how they work.

Let’s build a semantic search engine for a small set of documents. This means we’ll be able to find documents that are conceptually similar to a query, not just those that contain the exact same words.

Here’s our sample data. Imagine these are blog posts:

[
  {
    "id": 1,
    "title": "The Future of AI in Healthcare",
    "content": "Artificial intelligence is revolutionizing how we diagnose diseases, develop new drugs, and personalize patient treatments. Machine learning algorithms can analyze vast amounts of medical data to identify patterns invisible to the human eye. This leads to earlier detection and more effective interventions."
  },
  {
    "id": 2,
    "title": "Machine Learning for Beginners",
    "content": "Machine learning is a subset of artificial intelligence that allows systems to learn from data without explicit programming. It involves training models on datasets to recognize patterns and make predictions. Common applications include image recognition and natural language processing."
  },
  {
    "id": 3,
    "title": "Ethical Considerations in AI Development",
    "content": "As artificial intelligence becomes more powerful, ethical considerations are paramount. We must address issues of bias in algorithms, data privacy, job displacement, and the potential for misuse. Responsible AI development requires careful oversight and public discourse."
  },
  {
    "id": 4,
    "title": "Advancements in Medical Diagnostics",
    "content" : "New technologies are transforming medical diagnostics. From AI-powered image analysis to genetic sequencing, healthcare professionals have more tools than ever to understand patient conditions. Early and accurate diagnosis is crucial for successful treatment outcomes."
  }
]

To enable semantic search, we need to convert these text documents into numerical representations that capture their meaning. This is where OpenAI Embeddings come in. We’ll use the text-embedding-ada-002 model, which is currently the most capable and cost-effective option.

First, you’ll need the openai Python library. Install it:

pip install openai

Then, set your API key as an environment variable:

export OPENAI_API_KEY='your-api-key'

Now, let’s generate embeddings for our documents. Each embedding is a vector of 1536 floating-point numbers.

import openai
import os
import numpy as np

# Ensure your API key is set as an environment variable
# openai.api_key = os.getenv("OPENAI_API_KEY") # Deprecated, use openai.OpenAI()

client = openai.OpenAI()

documents = [
  {
    "id": 1,
    "title": "The Future of AI in Healthcare",
    "content": "Artificial intelligence is revolutionizing how we diagnose diseases, develop new drugs, and personalize patient treatments. Machine learning algorithms can analyze vast amounts of medical data to identify patterns invisible to the human eye. This leads to earlier detection and more effective interventions."
  },
  {
    "id": 2,
    "title": "Machine Learning for Beginners",
    "content": "Machine learning is a subset of artificial intelligence that allows systems to learn from data without explicit programming. It involves training models on datasets to recognize patterns and make predictions. Common applications include image recognition and natural language processing."
  },
  {
    "id": 3,
    "title": "Ethical Considerations in AI Development",
    "content": "As artificial intelligence becomes more powerful, ethical considerations are paramount. We must address issues of bias in algorithms, data privacy, job displacement, and the potential for misuse. Responsible AI development requires careful oversight and public discourse."
  },
  {
    "id": 4,
    "title": "Advancements in Medical Diagnostics",
    "content" : "New technologies are transforming medical diagnostics. From AI-powered image analysis to genetic sequencing, healthcare professionals have more tools than ever to understand patient conditions. Early and accurate diagnosis is crucial for successful treatment outcomes."
  }
]

document_embeddings = {}
for doc in documents:
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=f"{doc['title']}: {doc['content']}" # Include title for richer context
    )
    document_embeddings[doc['id']] = response.data[0].embedding
    print(f"Generated embedding for document ID: {doc['id']}")

# You would typically store these embeddings in a vector database (e.g., Pinecone, Weaviate, ChromaDB)
# For this example, we'll just keep them in memory.

The core idea of semantic search is that vectors close to each other in the high-dimensional space represent similar meanings. To find the most relevant document for a given query, we first generate an embedding for the query, and then we calculate the similarity between the query embedding and all document embeddings. The most common similarity metric is cosine similarity.

Let’s define a function to calculate cosine similarity.

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

Now, let’s perform a search. Our query is "How can computers help doctors?".

def search_documents(query, documents, document_embeddings):
    query_response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=query
    )
    query_embedding = query_response.data[0].embedding

    similarities = []
    for doc in documents:
        doc_id = doc['id']
        doc_embedding = document_embeddings[doc_id]
        similarity = cosine_similarity(query_embedding, doc_embedding)
        similarities.append((doc_id, similarity, doc['title']))

    # Sort by similarity in descending order
    similarities.sort(key=lambda x: x[1], reverse=True)

    return similarities

query = "How can computers help doctors?"
results = search_documents(query, documents, document_embeddings)

print(f"\nSearch results for '{query}':")
for doc_id, similarity, title in results:
    print(f"- {title} (Similarity: {similarity:.4f})")

Running this, we’d likely see "The Future of AI in Healthcare" and "Advancements in Medical Diagnostics" rank highest, even though they don’t contain the word "computers." This is the power of semantic search.

The system works by mapping text to a vector space where semantic relationships are preserved geometrically. Documents and queries are transformed into points in this space. When you search, you’re essentially finding the points (documents) closest to your query point. The text-embedding-ada-002 model is trained on a massive dataset to create these meaningful vector representations.

The one thing most people don’t realize is how critical the input prompt to the embedding model is, especially when dealing with documents that have distinct components like titles and bodies. Simply embedding doc['content'] might miss nuances captured by doc['title']. By concatenating them as f"{doc['title']}: {doc['content']}", we give the model more context, leading to more accurate and nuanced embeddings. Experimenting with different prompt structures can significantly improve search relevance.

The next step in building a robust system is to handle a much larger corpus of documents, which necessitates a vector database for efficient similarity searching.

Want structured learning?

Take the full Openai-api course →