JSON extraction is less about asking the LLM to "give me JSON" and more about teaching it to parse and format like a JSON parser.

Let’s say you have this raw, messy text:

User inquiry: I need to book a flight from New York (JFK) to London Heathrow (LHR) for two adults on March 15th, 2024. I'd prefer a non-stop flight if possible. My budget is around $1200.

And you want to extract structured data like this:

{
  "origin": "JFK",
  "destination": "LHR",
  "departure_date": "2024-03-15",
  "passengers": 2,
  "preferences": ["non-stop"],
  "budget": 1200
}

Here’s how you might prompt for it, focusing on the process of extraction rather than just the output format.

You are a highly skilled data extraction agent. Your task is to parse the following user inquiry and extract specific pieces of information into a JSON object.

For each field in the JSON, identify the corresponding information in the text. If a field is not present, use null.

**JSON Schema:**
{
  "origin": "string",
  "destination": "string",
  "departure_date": "YYYY-MM-DD",
  "passengers": "integer",
  "preferences": ["string"],
  "budget": "number"
}

**Extraction Rules:**
- For `origin` and `destination`, extract the 3-letter airport codes if available. If not, extract the city name.
- For `departure_date`, reformat the date into YYYY-MM-DD.
- For `passengers`, infer the number of adults. Assume 1 if not specified.
- For `preferences`, list any stated travel preferences (e.g., "non-stop", "window seat", "vegetarian meal").
- For `budget`, extract the numerical value of the budget. Ignore currency symbols and text like "around" or "up to".

**User Inquiry:**

User inquiry: I need to book a flight from New York (JFK) to London Heathrow (LHR) for two adults on March 15th, 2024. I’d prefer a non-stop flight if possible. My budget is around $1200.


**Output:**
```json

The LLM will then generate the JSON.

The most surprising true thing about this is that the LLM doesn’t actually "understand" JSON. It’s not a database or a parser. It’s a sequence predictor that has seen an astronomical amount of text, including countless examples of JSON code and its surrounding explanations. When you provide a JSON schema and extraction rules, you’re essentially giving it a highly specific pattern to match and complete. It’s not validating against a schema; it’s generating text that conforms to the pattern described by the schema and rules, based on the input text.

Consider the departure_date. The input is "March 15th, 2024". The schema specifies "YYYY-MM-DD". The LLM has learned from its training data that "March 15th, 2024" is a common way to represent a date, and that "2024-03-15" is another, more structured way. It predicts the latter sequence of characters because it’s the most probable continuation given the context of a YYYY-MM-DD format and the input date.

Let’s run it with a slightly different input and see how the extraction adapts.

User inquiry: Planning a trip from SFO to DEN for one person next Friday. Need to keep costs under $500. Flexible on dates, but would like to fly out in the morning.

And the prompt:

You are a highly skilled data extraction agent. Your task is to parse the following user inquiry and extract specific pieces of information into a JSON object.

For each field in the JSON, identify the corresponding information in the text. If a field is not present, use null.

**JSON Schema:**
{
  "origin": "string",
  "destination": "string",
  "departure_date": "YYYY-MM-DD",
  "passengers": "integer",
  "preferences": ["string"],
  "budget": "number"
}

**Extraction Rules:**
- For `origin` and `destination`, extract the 3-letter airport codes if available. If not, extract the city name.
- For `departure_date`, reformat the date into YYYY-MM-DD. If a relative date is given (e.g., "next Friday"), infer the actual date assuming today is 2024-03-08.
- For `passengers`, infer the number of adults. Assume 1 if not specified.
- For `preferences`, list any stated travel preferences (e.g., "non-stop", "window seat", "vegetarian meal", "morning flight").
- For `budget`, extract the numerical value of the budget. Ignore currency symbols and text like "around" or "up to".

**User Inquiry:**

User inquiry: Planning a trip from SFO to DEN for one person next Friday. Need to keep costs under $500. Flexible on dates, but would like to fly out in the morning.


**Output:**
```json

The most important lever you control is how explicitly you define the transformation or inference rules. For dates, you might need to tell it "assume today’s date is YYYY-MM-DD" if you’re dealing with relative terms like "next Friday." For numerical values, specifying how to handle "under $500" versus "$500" is crucial. The LLM doesn’t have a built-in calendar or a financial calculator; it’s pattern matching and completing sequences based on your instructions and its training data.

The trickiest part for many users is that LLMs don’t inherently perform arithmetic or date calculations. When you ask to infer "next Friday" from "2024-03-08", it’s not calculating that it’s March 15th. It’s predicting the string "2024-03-15" because its training data associates "next Friday" (in the context of a March 8th reference, if that context is provided or implied) with that specific date string, especially when prompted to output in YYYY-MM-DD. This is why providing explicit context or examples for date/number transformations is vital.

The next challenge is handling complex nesting and conditional logic within the JSON structure.

Want structured learning?

Take the full Prompt-engineering course →