The most surprising truth about prompt engineering for entity extraction is that it’s often less about crafting the perfect "prompt" and more about carefully constructing the output format you expect from the LLM.
Let’s see this in action. Imagine we want to extract company names and their founding years from a piece of text.
The company, FutureCorp, was established in 2015. It quickly grew.
Later, InnovateSolutions, founded in 2018, disrupted the market.
A naive prompt might be: "Extract company names and founding years from this text." The LLM might give you:
FutureCorp, 2015
InnovateSolutions, 2018
This is okay, but what if the text was more complex? Or what if you needed this data for a downstream system that expects JSON? The LLM’s freeform output becomes a problem.
The real power comes from telling the LLM exactly how to structure its answer. We can use a technique often called "few-shot learning" with a structured output format.
Here’s a more effective prompt:
Extract structured data about companies and their founding years from the following text.
Return the output as a JSON array of objects, where each object has keys "company_name" and "founding_year".
Text:
The company, FutureCorp, was established in 2015. It quickly grew.
Later, InnovateSolutions, founded in 2018, disrupted the market.
JSON Output:
Now, when you send this to an LLM (like GPT-4, Claude 3, or Gemini), you’ll get something like this:
[
{
"company_name": "FutureCorp",
"founding_year": 2015
},
{
"company_name": "InnovateSolutions",
"founding_year": 2018
}
]
This JSON output is machine-readable, consistent, and much more useful. You’ve effectively turned the LLM into a data parser.
The mental model here is that the LLM isn’t just "understanding" your text; it’s performing a sophisticated pattern-matching and generation task based on the provided context and your instructions. By specifying the output structure (like JSON, YAML, or even a specific CSV format), you guide its generation process.
The "problem" this solves is moving beyond simple text-to-text generation to reliable, structured data extraction. This is crucial for automating workflows, populating databases, and feeding information into other applications. The LLM acts as a flexible, intelligent "ETL" tool, capable of understanding unstructured text and transforming it into structured data.
The levers you control are:
- The input text: The quality and clarity of the source material.
- The explicit instructions: What you want extracted.
- The output format specification: How you want it returned. This is the most powerful lever.
- Few-shot examples (optional but recommended): Providing 1-3 examples of input text and the exact desired output format reinforces the LLM’s understanding of your requirements.
Consider this prompt for a slightly more complex scenario, extracting product names and their prices:
Extract product information from the following text.
Return the output as a JSON array of objects, where each object has keys "product_name" and "price".
The price should be an integer representing cents (e.g., $19.99 becomes 1999).
Text:
We offer the amazing Widget Pro for just $49.95.
Also available is the Standard Widget at $25.00.
Don't forget the Deluxe Widget, priced at $75.50.
JSON Output:
This would yield:
[
{
"product_name": "Widget Pro",
"price": 4995
},
{
"product_name": "Standard Widget",
"price": 2500
},
{
"product_name": "Deluxe Widget",
"price": 7550
}
]
The key here is that the LLM is capable of performing the arithmetic transformation (dollars to cents) because you’ve explicitly instructed it and provided an example of the desired output structure that reflects this transformation.
One of the most potent, yet often overlooked, aspects of this is the LLM’s ability to infer implicit relationships and perform simple data transformations if guided by the output structure. You don’t need to prompt engineer a separate step for "convert currency" or "parse dates" if you can bake that transformation into the expected output format itself. For instance, asking for a Unix timestamp instead of a date string, or a numerical ID instead of a person’s name, can be achieved by defining the output schema precisely. The LLM then "figures out" how to map the input text to that specific, transformed output representation.
The next step is often handling disambiguation or context-dependent extraction.