You can run sophisticated large language models like Llama, Mistral, and Gemma directly on your own hardware, bypassing the need for cloud APIs and their associated costs or privacy concerns.

Let’s see this in action. Imagine you’ve just installed Ollama. You can pull down a model and start chatting with it in seconds.

ollama pull llama3
ollama run llama3

After ollama pull llama3, you’ll see download progress for the model weights. Once downloaded, ollama run llama3 starts an interactive session. You’ll see a prompt >>> where you can type your questions.

For example:

>>> What is the capital of France?
Paris is the capital of France.

This is happening because Ollama downloads the model weights (the learned parameters of the neural network) and a runtime environment that can execute those weights on your CPU or GPU. The run command loads the model into memory and presents a simple text interface for interaction.

The core problem Ollama solves is the complexity of setting up and running these large, often multi-gigabyte, models locally. Traditionally, this involved managing dependencies, compiling code, and configuring inference engines. Ollama abstracts all of that away.

Internally, Ollama uses a combination of technologies. The model weights themselves are often in a format like GGUF (GPT-Generated Unified Format), which is optimized for efficient loading and inference. The runtime environment is built using libraries like llama.cpp (for CPU and GPU acceleration) and other components that handle model loading, prompt processing, and generation. Ollama also provides a REST API, allowing other applications to interact with the models programmatically.

When you ollama pull llama3, you’re downloading the llama3 model’s specific configuration and weights, often a compressed archive. When you ollama run llama3, Ollama unpacks these weights, loads them into RAM (or VRAM if a GPU is available and configured), and sets up an inference loop. The prompt you type is tokenized (converted into numbers the model understands), fed through the model’s layers, and then the output tokens are detokenized back into human-readable text.

The key levers you control are the models you choose to pull and run, and the parameters you can pass to the run command or the API. For instance, you can specify a GPU:

OLLAMA_GPU=1 ollama run llama3

This tells Ollama to prioritize using your GPU for inference if one is detected and compatible. You can also adjust generation parameters like temperature, top-p, and the number of tokens to generate, which significantly impact the output’s creativity and coherence.

ollama run llama3 "Explain quantum entanglement in simple terms" --temp 0.8 --top_p 0.9 --num-predict 256

The llama.cpp backend, which is central to Ollama’s performance, supports a wide range of quantization levels for models. Quantization reduces the precision of the model’s weights (e.g., from 16-bit floating point to 4-bit integers), drastically shrinking the model size and memory footprint, making it feasible to run larger models on less powerful hardware. However, the specific quantization method used (e.g., Q4_K_M, Q5_K_S) can have a subtle but noticeable impact on inference speed and the quality of the generated text, with higher bit quantizations generally offering better fidelity at the cost of larger file sizes and slightly slower inference.

Beyond just running models, Ollama also manages them for you, allowing you to list, inspect, and remove downloaded models with simple commands.

ollama list
ollama rm llama3

The next step is often exploring how to integrate these locally running models into your own applications using Ollama’s API.

Want structured learning?

Take the full Ollama course →