Ollama’s REST API is actually a surprisingly powerful tool for integrating local Large Language Models (LLMs) into your Python applications, often bypassing the need for complex client libraries.
Let’s see it in action. Imagine you have Ollama running locally and have pulled a model like llama3. You can interact with it directly via HTTP.
import requests
import json
url = "http://localhost:11434/api/generate"
payload = {
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": False
}
headers = {'Content-Type': 'application/json'}
response = requests.post(url, data=json.dumps(payload), headers=headers)
data = response.json()
print(data['response'])
This simple Python script, using the standard requests library, sends a POST request to Ollama’s /api/generate endpoint. The payload specifies the model to use (llama3), the prompt for the LLM, and crucially, stream: False to get a single, complete response. Ollama processes this request and sends back a JSON object containing the LLM’s generated response.
The core problem Ollama solves is democratizing access to powerful LLMs. Instead of relying on cloud-based APIs with their associated costs and latency, you can run these models on your own hardware. The REST API provides a universal interface, meaning you don’t need specific Python SDKs for each LLM provider; a simple HTTP client is enough. Internally, Ollama manages the model loading, inference, and API serving. When you send a request, Ollama routes it to the appropriate loaded model, executes the inference, and formats the output back into a JSON response.
The exact levers you control are primarily within the payload dictionary. You can adjust model to switch between any LLMs you’ve downloaded with Ollama. The prompt is your direct instruction to the LLM. stream: True will yield a response token by token, allowing for more interactive applications, while False gives you the final, complete answer. Other parameters like temperature, top_p, num_predict, and max_tokens offer fine-grained control over the generation process, influencing creativity, coherence, and length. For example, setting temperature to 0.1 will make the output more deterministic and focused, while 1.0 encourages more randomness and creativity.
When you set stream: True in your API call, Ollama doesn’t wait for the entire response to be generated before sending anything back. Instead, it sends a series of smaller JSON objects over a single HTTP connection, each containing a chunk of the generated text, often delimited by newline characters. This allows your Python application to start processing or displaying the LLM’s output as it’s being produced, creating a more responsive user experience, similar to how you see text appearing word by word in a chat interface.
The next step is exploring the /api/chat endpoint for conversational interactions and managing model lifecycles with the /api/pull and /api/delete endpoints.