A code repository is a latent knowledge base, and RAG is the key to unlocking its secrets without needing to train a massive, proprietary model.

Let’s get this pipeline humming. We’ll use a Python setup, but the concepts translate. First, you need your code. Imagine it’s a messy but brilliant engineer’s notebook.

# Example Python code snippet
def calculate_discount(price: float, discount_percentage: float) -> float:
    """Calculates the final price after applying a discount."""
    if not 0 <= discount_percentage <= 100:
        raise ValueError("Discount percentage must be between 0 and 100.")
    discount_amount = price * (discount_percentage / 100)
    return price - discount_amount

class Product:
    def __init__(self, name: str, price: float):
        self.name = name
        self.price = price

    def get_discounted_price(self, discount_percentage: float) -> float:
        return calculate_discount(self.price, discount_percentage)

To make this searchable, we need to break it down and embed it.

1. Code Parsing and Chunking

You can’t just embed raw code files. You need to parse them into meaningful units. Libraries like tree-sitter are fantastic for this, as they understand code structure. For simpler cases, regular expressions or even just splitting by function/class definitions can work.

Let’s say we parse the above code. We might get chunks like:

  • Chunk 1:
    def calculate_discount(price: float, discount_percentage: float) -> float:
        """Calculates the final price after applying a discount."""
        if not 0 <= discount_percentage <= 100:
            raise ValueError("Discount percentage must be between 0 and 100.")
        discount_amount = price * (discount_percentage / 100)
        return price - discount_amount
    
  • Chunk 2:
    class Product:
        def __init__(self, name: str, price: float):
            self.name = name
            self.price = price
    
        def get_discounted_price(self, discount_percentage: float) -> float:
            return calculate_discount(self.price, discount_percentage)
    

The key is to keep related code together. A single function or a class definition with its methods is a good starting point.

2. Embedding

Now, each chunk needs to be converted into a numerical vector (an embedding) that captures its semantic meaning. This is where models like sentence-transformers shine.

from sentence_transformers import SentenceTransformer

# Load a pre-trained model suitable for code or general text
# 'all-MiniLM-L6-v2' is a good general-purpose, fast model
# 'code-search-net-mini-v2' or similar might be more specialized
model = SentenceTransformer('all-MiniLM-L6-v2')

code_chunks = [
    """def calculate_discount(price: float, discount_percentage: float) -> float:
        \"\"\"Calculates the final price after applying a discount.\"\"\"
        if not 0 <= discount_percentage <= 100:
            raise ValueError("Discount percentage must be between 0 and 100.")
        discount_amount = price * (discount_percentage / 100)
        return price - discount_amount""",
    """class Product:
        def __init__(self, name: str, price: float):
            self.name = name
            self.price = price

        def get_discounted_price(self, discount_percentage: float) -> float:
            return calculate_discount(self.price, discount_percentage)"""
]

# Generate embeddings
embeddings = model.encode(code_chunks)

print(embeddings.shape) # Output will be (2, 384) for all-MiniLM-L6-v2

Each row in embeddings is a vector for a corresponding code chunk.

3. Vector Database

These embeddings need to be stored and searched efficiently. A vector database (like FAISS, Pinecone, ChromaDB, Weaviate) is designed for this. It allows for fast similarity searches (finding vectors closest to a query vector).

Let’s simulate adding to a ChromaDB:

import chromadb

client = chromadb.Client()
collection = client.create_collection("code_repo")

# Add embeddings and their corresponding code chunks
collection.add(
    embeddings=[list(emb) for emb in embeddings], # ChromaDB often expects lists
    documents=code_chunks,
    ids=["chunk_1", "chunk_2"]
)

4. The RAG Process: Querying

When a user asks a question, like "How do I apply a discount to a product?", you first embed their question using the same embedding model.

query = "How do I apply a discount to a product?"
query_embedding = model.encode(query)

# Search the vector database for the most similar code chunks
results = collection.query(
    query_embeddings=[list(query_embedding)],
    n_results=1 # Get the top 1 most relevant chunk
)

print(results['documents'][0])

This would likely return the Product class and its get_discounted_price method, as it’s semantically closest to the query.

5. Generation

Finally, you take the retrieved code chunk(s) and use them as context for a large language model (LLM) to generate an answer.

from openai import OpenAI # Or any other LLM provider

llm_client = OpenAI(api_key="YOUR_API_KEY")

context = results['documents'][0] # The retrieved code

prompt = f"""Given the following Python code:

```python
{context}

Answer the question: "How do I apply a discount to a product?" Provide a concise explanation and show the relevant code usage."""

response = llm_client.chat.completions.create( model="gpt-4o", # Or gpt-3.5-turbo, etc. messages=[ {"role": "system", "content": "You are a helpful AI assistant that answers questions based on provided code context."}, {"role": "user", "content": prompt} ] )

print(response.choices[0].message.content)


The LLM, guided by the context, can now generate a precise answer like: "To apply a discount to a product, you can use the `get_discounted_price` method of the `Product` class. For example: `product_instance.get_discounted_price(15)` will apply a 15% discount."

The real magic is how the embedding model can bridge the gap between natural language queries and structured code, allowing an LLM to "understand" and reason about your codebase. This means you can ask "What function handles user authentication?" or "Show me how to paginate results" and get accurate, code-specific answers without ever needing to grep your repository manually.

Want structured learning?

Take the full Rag course →