Ollama is a tool that lets you run large language models (LLMs) locally on your own machine, and Docker is a way to package and run applications in isolated containers. Together, they make it surprisingly easy to get powerful AI models running without needing a cloud account or complex setup.
Here’s a glimpse of Ollama running a model, specifically the llama3 model, and then interacting with it:
# Pull the llama3 model
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# Wait a moment for the container to start, then:
docker exec -it ollama ollama run llama3
>>> What is the capital of France?
Paris.
>>> Tell me a short story about a space-faring cat.
This setup uses Docker to run the Ollama service. The -d flag runs it in detached mode (in the background). --gpus=all is crucial if you have a compatible GPU and want to use it for much faster inference; without it, Ollama will default to your CPU, which is significantly slower for LLMs. The -v ollama:/root/.ollama part creates a Docker volume named ollama and mounts it inside the container at /root/.ollama. This is where Ollama stores its models, so they persist even if you stop and remove the Docker container. -p 11434:11434 maps port 11434 on your host machine to the same port inside the container, which is the default API port for Ollama. --name ollama gives the container a recognizable name.
Once Ollama is running, docker exec -it ollama ollama run llama3 allows you to interact with the llama3 model directly from your terminal. The ollama run command will download the model if it’s not already present.
The real magic of Ollama is how it abstracts away the complexities of LLM deployment. Typically, you’d need to download massive model files (often tens or hundreds of gigabytes), manage dependencies like specific Python libraries and CUDA versions, and write boilerplate code to load and query the model. Ollama handles all of this. When you run ollama run llama3, it downloads the llama3 model, sets up the necessary environment, and makes it available via a simple API (and the command-line interface you just used).
The core problem Ollama solves is making LLMs accessible. It democratizes AI by lowering the barrier to entry. You don’t need a Ph.D. in machine learning or a massive budget for cloud GPUs to experiment with and integrate state-of-the-art models.
Internally, Ollama is a Go application that acts as a server. It manages the lifecycle of various LLM runtimes (like llama.cpp for CPU and GPU inference) and provides a REST API. When you send a request to Ollama, it determines which model you’re using, loads it into memory (either on your CPU or GPU), processes your prompt, and returns the generated text. The Docker image simply packages this Ollama server and its dependencies.
The configuration is minimal. Most of it is handled by the Docker command. However, you can customize things like the port mapping or volume mounts as shown. For more advanced configurations, like setting up Ollama to use a specific GPU device or configuring GPU memory allocation, you might need to delve into Ollama’s configuration files, which are typically located within the mounted volume (/root/.ollama inside the container).
When you request a model with ollama run <model_name>, Ollama checks its local storage (the volume). If the model isn’t there, it downloads it from its registry. The models are stored in a specific format optimized for fast loading and inference. Ollama also manages different versions of models, allowing you to switch between them.
The Ollama API is a key component. It exposes endpoints for managing models (pulling, listing, removing) and for generating text. This makes it easy to integrate Ollama into other applications. For example, you could build a chatbot that uses Ollama as its backend LLM.
The most surprising thing is how many different model architectures Ollama supports out-of-the-box with minimal fuss. You can go from llama3 to mistral to phi3 with just a docker exec -it ollama ollama pull <new_model> and then docker exec -it ollama ollama run <new_model>, and the underlying inference engine handles the differences seamlessly. It’s not just about running one type of model; it’s a general-purpose LLM runner.
The next step is exploring how to connect other applications to the Ollama API.