OpenAI Vision can analyze documents and images, but its real magic is in its ability to bridge the gap between pixel data and structured understanding, transforming raw visual input into actionable insights.
Let’s see it in action. Imagine you have a scanned invoice, a screenshot of a product page, or even a photo of a whiteboard with meeting notes. You can feed this image directly to the Vision API.
import base64
import requests
import os
# Replace with your actual API key
API_KEY = os.environ.get("OPENAI_API_KEY")
if API_KEY is None:
raise ValueError("Please set the OPENAI_API_KEY environment variable.")
# Function to encode the image to base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Path to your image file
image_file_path = "path/to/your/invoice.png" # Replace with your image file
base64_image = encode_image(image_file_path)
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}"
}
payload = {
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract the invoice number, total amount, and due date from this invoice."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}
],
"max_tokens": 300
}
response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
print(response.json())
Running this code with an invoice image might produce output like this:
{
"id": "chatcmpl-xxxxxxxxxxxxxxxxxxxxxxxxx",
"object": "chat.completion",
"created": 1704053652,
"model": "gpt-4-vision-preview-xxxx",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Invoice Number: INV-2023-00123\nTotal Amount: $1,250.75\nDue Date: 2023-12-31"
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 150,
"completion_tokens": 25,
"total_tokens": 175
}
}
This demonstrates how Vision can ingest an image and, guided by a text prompt, extract specific, structured data.
The core problem Vision solves is the inherent difficulty in making computers "see" and "understand" visual content. Traditional OCR (Optical Character Recognition) is good at converting text in images to machine-readable text, but it struggles with layout, context, and non-textual elements. Vision, powered by large multimodal models, goes beyond simple OCR. It understands the meaning of the visual elements.
Internally, the model processes the image through a vision encoder, which converts the image into a series of embeddings. These embeddings are then fed into a language model alongside the text prompt. The language model, trained on vast amounts of image-text pairs, can then reason about the visual content and generate a coherent textual response. The gpt-4-vision-preview model is a prime example, capable of handling both text and image inputs within a single prompt.
You control Vision primarily through the prompt and the image input itself. The prompt is your instruction manual: what do you want the model to do? Do you want it to describe the image, identify objects, read text, extract specific data points, or even compare two images? The content array in the payload allows you to mix and match text and image URLs (or base64 encoded images) to create complex queries. The max_tokens parameter limits the length of the generated response, ensuring efficiency.
The way the model interprets spatial relationships is a key differentiator. It doesn’t just see words on a page; it understands that a number listed next to "Total" is likely the total amount, and that a date appearing near "Due By" is the due date. This contextual understanding is crucial for extracting information accurately from semi-structured documents like invoices, receipts, or forms. It can also discern nuances in images, like identifying the brand of a product from a photo or describing the sentiment of a scene.
When you provide multiple images in a single request, the model treats them as a sequence, allowing for comparisons or analyses that span across different visual inputs. It can answer questions like "Which of these two logos is more prominent?" or "Summarize the key differences between these two product descriptions shown in screenshots."
The next frontier is understanding dynamic visual content and integrating Vision with other modalities for even richer interactions.