Ollama’s OpenAI API compatibility means you can run large language models locally and swap them in for OpenAI’s cloud-based services with minimal code changes.
Let’s see Ollama in action with ollama run llama3 and then query it using curl.
# Start a model (if not already running)
ollama run llama3
# In a separate terminal, make a request
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'
The output will look something like this:
{
"id": "chatcmpl-1234567890",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 5,
"total_tokens": 15
}
}
This demonstrates how Ollama mimics the OpenAI API structure, allowing existing applications to point to http://localhost:11434/v1 instead of https://api.openai.com/v1 and use local models.
The core problem Ollama solves is the cost, latency, and data privacy concerns associated with relying solely on cloud-based LLM APIs. By running models locally, you gain control over your data, reduce network round-trip times, and avoid per-token pricing. Ollama handles the complexity of downloading, managing, and serving these models, presenting a unified, OpenAI-compatible interface.
Internally, Ollama uses a Go backend to manage model lifecycles and serve requests. When you run ollama run <model_name>, Ollama downloads the model weights (often in GGUF format) and configuration files. It then spins up a server process that loads the model into memory (or VRAM if a GPU is available) and exposes an HTTP API. This API is designed to mirror the OpenAI API endpoints, specifically /v1/chat/completions and /v1/embeddings. The request is then translated by Ollama’s backend into instructions for the underlying inference engine (like llama.cpp for Llama models) to generate a response.
The model field in the curl request ("model": "llama3") directly maps to a model name that Ollama has downloaded and is ready to serve. Ollama maintains a local registry of available models. If you try to use a model that isn’t downloaded, Ollama will typically prompt you to download it or fail with an informative error. You can list your downloaded models with ollama list.
The usage field in the response, showing prompt_tokens, completion_tokens, and total_tokens, is also a direct mimicry of the OpenAI API. While Ollama itself doesn’t charge per token, this field is useful for applications that are designed to track token consumption for budgeting or comparison purposes, even when running locally.
One of the most powerful, yet often overlooked, aspects of Ollama is its ability to serve multiple models concurrently and switch between them seamlessly. You don’t need to restart Ollama or reconfigure your application if you want to try a different model. Simply ensure the desired model is downloaded (ollama pull mistral) and update the model field in your API request. Ollama manages the loading and unloading of these models, dynamically allocating resources as needed. This means a single Ollama instance can act as a central LLM gateway for various applications on your machine, each potentially using a different model tailored for a specific task.
The next step you’ll likely encounter is managing model performance and resource utilization, particularly when running larger models on hardware with limited VRAM or CPU power.