Ollama doesn’t just run LLMs; it makes them feel like any other local application you’d install, just with exponentially more parameters.

Let’s see it in action. First, you’ll need to install Ollama. On macOS, it’s a simple brew install ollama. For Linux, you’ll curl and pipe:

curl -fsSL https://ollama.com/install.sh | sh

Once installed, pulling a model is like docker pull. Let’s grab Llama 3:

ollama pull llama3

This downloads the model weights. You’ll see a progress bar, and it can take a while depending on your connection and the model size. For llama3, expect around 4GB.

Now, to chat with it, you run:

ollama run llama3

This drops you into an interactive session. You can type prompts, and the model will respond.

>>> What is the capital of France?
Paris.

You can also interact with it programmatically. Ollama exposes an OpenAI-compatible API. If you curl it:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

This will give you a JSON response:

{
  "model": "llama3",
  "created_at": "2023-05-09T10:30:00.123Z",
  "response": "The sky appears blue due to a phenomenon called Rayleigh scattering. Sunlight, which is composed of all the colors of the rainbow, enters Earth's atmosphere. As it travels through the air, it collides with tiny molecules of gases like nitrogen and oxygen. These molecules scatter the sunlight in all directions. Blue light has shorter wavelengths than red light, so it is scattered more effectively by these small atmospheric particles. This scattered blue light reaches our eyes from all directions, making the sky appear blue.",
  "done": true,
  "context": [ ... ],
  "total_duration": 15000000000,
  "load_duration": 500000000,
  "prompt_eval_count": 10,
  "eval_count": 75,
  "eval_duration": 14500000000
}

Ollama’s core value proposition is abstracting away the complexities of running LLMs locally. It handles model downloading, management, and serving them via a standard API. This means you don’t need to wrestle with Python environments, CUDA installations, or specific model inference libraries for every new model you want to try. You just ollama pull and ollama run.

The ollama run command, when used interactively, is essentially a client to the Ollama server that’s running in the background. This server manages the model files on disk and loads them into GPU (or CPU) memory when requested. When you send a prompt, the server preprocesses it, feeds it to the loaded model, and streams the generated tokens back to your terminal. The stream: false in the API call tells it to wait until the entire response is generated before returning. Setting stream: true would yield tokens as they are generated, useful for more interactive applications.

The context field in the API response is a crucial piece of how LLMs maintain conversational memory. When you have a multi-turn conversation, Ollama keeps track of this context. Each subsequent prompt you send includes the previous context, allowing the model to understand the ongoing dialogue. The size of this context window is a property of the specific model you’re using (e.g., Llama 3 8B has a context window of 8192 tokens).

A detail often overlooked is how Ollama manages hardware. By default, it tries to use your GPU if available. If you have multiple GPUs, it will try to spread the model layers across them for better performance. You can influence this with environment variables, like OLLAMA_DEBUG=1 for verbose logging, or by specifying which GPUs to use via OLLAMA_HOST and OLLAMA_NUM_PARTS if you’re running on specialized hardware or have specific performance tuning needs. For instance, to force Ollama to use only your CPU (which will be significantly slower), you could set the OLLAMA_HOST environment variable to http://127.0.0.1:11434 and ensure no GPU-related libraries are visible to the Ollama process, or more directly, if you’re running via Docker, you can control device access.

The prompt evaluation count (prompt_eval_count) and the total evaluation count (eval_count) give you insight into how many tokens the model processed. The eval_duration is the time spent actually generating the response. These metrics are invaluable for understanding the performance characteristics of different models and hardware configurations.

Once you’re comfortable with running models locally, the next logical step is to explore fine-tuning existing models or even training your own, which Ollama can then serve.

Want structured learning?

Take the full Ollama course →