Text-to-SQL generation models can sometimes produce syntactically incorrect SQL queries or queries that don’t align with the user’s intent, often due to ambiguity in the natural language prompt or limitations in the model’s understanding of the database schema.

Let’s see how this actually looks. Imagine a user wants to find the total sales for each product.

User Prompt: "Show me total sales per product."

A naive Text-to-SQL model might produce something like this:

SELECT
  product_name,
  SUM(quantity * price) AS total_sales
FROM
  orders
GROUP BY
  product_name;

This looks reasonable if the orders table has product_name, quantity, and price directly. But what if the schema is more normalized?

Scenario 1: Normalized Schema

Let’s say our database schema is:

  • products table: product_id (PK), product_name
  • order_items table: order_item_id (PK), order_id (FK), product_id (FK), quantity
  • order_prices table: order_price_id (PK), order_item_id (FK), price (at the time of order)

The previous query would fail. It can’t find product_name in orders and price isn’t directly there either. A better query, given this schema, would be:

SELECT
  p.product_name,
  SUM(oi.quantity * op.price) AS total_sales
FROM
  products AS p
JOIN
  order_items AS oi
ON
  p.product_id = oi.product_id
JOIN
  order_prices AS op
ON
  oi.order_item_id = op.order_item_id
GROUP BY
  p.product_name;

This highlights a key challenge: the model needs to understand the relationships between tables (joins) and how to correctly aggregate data across them.

Scenario 2: Ambiguous Natural Language

What if the user meant something slightly different?

User Prompt: "What are the top 5 products by revenue?"

Here, "revenue" could mean gross revenue (as above) or net revenue (if there are discounts).

User Prompt: "List customers who bought product X."

This is ambiguous. Did they buy it once? Multiple times? Did they order it, or did the order complete?

Prompt Engineering to the Rescue

We can guide the model by providing more context in the prompt. This is where prompt engineering comes in.

  1. Schema Awareness: The most effective Text-to-SQL models are often fine-tuned on a specific database schema or are provided with schema information in the prompt.

    Example Prompt:

    Given the following database schema:
    
    -- Table: products
    -- Columns: product_id (INT, PK), product_name (VARCHAR)
    
    -- Table: order_items
    -- Columns: order_item_id (INT, PK), order_id (INT, FK), product_id (INT, FK), quantity (INT)
    
    -- Table: order_prices
    -- Columns: order_price_id (INT, PK), order_item_id (INT, FK), price (DECIMAL)
    
    Generate a SQL query for the following request: "Show me total sales for each product."
    
    -- SQL Query:
    

    The model can now use the provided schema to construct the correct joins and select the right columns. The output would be the second query shown above.

  2. Clarifying Ambiguity with Examples (Few-Shot Learning): You can provide examples of natural language requests and their corresponding SQL queries.

    Example Prompt:

    Given the following database schema:
    ... (schema as above) ...
    
    Here are some examples:
    Request: "List all product names."
    SQL: "SELECT product_name FROM products;"
    
    Request: "How many orders contain product with ID 123?"
    SQL: "SELECT COUNT(DISTINCT order_id) FROM order_items WHERE product_id = 123;"
    
    Now, generate a SQL query for the following request: "Show me total sales for each product."
    
    -- SQL Query:
    

    This helps the model understand the style and logic of translation for your specific database and common query patterns.

  3. Specifying Aggregations and Filters: If the request is complex, break it down or explicitly state what you need.

    User Prompt: "Show me the total revenue for each product, excluding any orders placed before 2023."

    Engineered Prompt:

    Given the following database schema:
    ... (schema as above) ...
    And a table `orders` with `order_id` (PK) and `order_date` (DATE)
    
    Generate a SQL query for the following request: "Show me the total sales for each product, filtering out orders placed before January 1, 2023."
    
    -- SQL Query:
    

    The model would need to join the orders table as well and apply the WHERE clause.

    SELECT
      p.product_name,
      SUM(oi.quantity * op.price) AS total_sales
    FROM
      products AS p
    JOIN
      order_items AS oi
    ON
      p.product_id = oi.product_id
    JOIN
      order_prices AS op
    ON
      oi.order_item_id = op.order_item_id
    JOIN
      orders AS o
    ON
      oi.order_id = o.order_id
    WHERE
      o.order_date >= '2023-01-01'
    GROUP BY
      p.product_name;
    
  4. Handling Data Types and Functions: Sometimes, you need to guide the model on how to handle specific data types or use certain SQL functions.

    User Prompt: "What is the average quantity sold per order item for product 'Gadget X'?"

    Engineered Prompt:

    Given the following database schema:
    ... (schema as above) ...
    Use the AVG() aggregate function and ensure the result is a decimal.
    
    Generate a SQL query for the following request: "What is the average quantity sold per order item for product 'Gadget X'?"
    
    -- SQL Query:
    
    SELECT
      AVG(CAST(oi.quantity AS DECIMAL(10, 2))) AS average_quantity
    FROM
      order_items AS oi
    JOIN
      products AS p
    ON
      oi.product_id = p.product_id
    WHERE
      p.product_name = 'Gadget X';
    

    The CAST ensures the average is computed with decimal precision.

  5. Post-Processing and Validation: Even with good prompts, it’s crucial to validate the generated SQL. This might involve:

    • Syntax Check: Running the SQL through a linter or a dry-run execution.
    • Schema Validation: Ensuring all referenced tables and columns exist and are used correctly.
    • Logical Validation: Manually checking if the query actually answers the user’s question, especially for complex requests.

The core idea is that Text-to-SQL models are powerful but often benefit from explicit guidance. By providing schema context, examples, and clear instructions, you can significantly improve the accuracy and relevance of the generated SQL queries.

One subtle but powerful technique is to include a "database dialect" hint if your model supports it, like sql -- [PostgreSQL](/articles/postgres/)` or sql – MySQL`. This primes the model to use syntax and functions specific to that database system, avoiding common cross-dialect errors.

The next hurdle you’ll likely encounter is handling complex subqueries or window functions, which require even more precise prompt engineering and often a deeper understanding of the underlying SQL concepts by the model.

Want structured learning?

Take the full Prompt-engineering course →