Ollama CLI is more than just a way to download and run LLMs; it’s a surprisingly powerful tool for managing your local AI experiments.
Here’s Ollama in action, serving a model locally:
ollama serve &
Now, let’s interact with it using curl to see a model in action, just like the API would:
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false
}'
This sends a request to the llama2 model running on your local Ollama server and gets a complete, non-streaming response.
Core Concepts & Commands
Ollama CLI is built around a few key entities: models, tags (versions of models), and the server itself.
1. Listing Available Models:
To see what you’ve got downloaded and ready to go:
ollama list
Output might look like:
NAME ID SIZE MODIFIED
llama2:latest f7601789a23b 3.8 GB 2 hours ago
mistral:latest b67a6741452a 4.1 GB 1 day ago
This shows the model name, its unique ID (useful for some advanced operations), its disk usage, and when it was last updated.
2. Downloading Models:
The primary way to get models is ollama pull:
ollama pull llama3
This downloads the llama3 model with its latest tag. You can specify versions:
ollama pull mistral:7b
This gets the specific 7b parameter version of Mistral. The download progress will be displayed, showing chunks being pulled and the total size.
3. Running Models (Interactive Chat):
The most common use is interactive chat:
ollama run llama3
This drops you into an interactive prompt where you can chat with the llama3 model. Type your message, hit enter, and the model responds. Type /bye to exit.
4. Running Models (Single Prompt):
For quick, non-interactive responses:
ollama run llama3 "What is the capital of France?"
This will print the answer and immediately exit the prompt.
5. Creating and Managing Your Own Models (Modelfiles):
This is where Ollama’s power for customization shines. You define models using a Modelfile.
Example Modelfile for a simple RAG (Retrieval Augmented Generation) setup:
FROM ./my_embeddings_model
SYSTEM "You are a helpful assistant. Answer questions based on the provided context."
PARAMETER temperature 0.7
PARAMETER top_k 50
EMBED "This is a document about AI. It discusses the history and future of artificial intelligence."
EMBED "Another document about machine learning, a subset of AI."
To create a model from this:
ollama create my-rag-model -f ./Modelfile
This builds a new model named my-rag-model using the specified embedding model and injecting the provided text as context. The SYSTEM instruction primes the model’s behavior, and PARAMETER allows tuning generation settings.
6. Pushing Models to Ollama Hub (Self-Hosted):
If you’ve created custom models and want to share them within your organization or with the public, you can push them:
ollama push my-custom-model
This requires you to be logged in (ollama login) and assumes you have a model tagged appropriately.
7. Removing Models:
Free up disk space:
ollama rm llama2
This removes the llama2 model and all its associated tags.
8. Serving Models Locally:
To make your models available via API for other applications:
ollama serve
This starts the Ollama server in the foreground. For background operation, use ollama serve &. The API is typically served on http://localhost:11434.
9. Ollama Serve Configuration:
You can configure the server, for example, to change the host and port:
OLLAMA_HOST=0.0.0.0 OLLAMA_PORT=8080 ollama serve
This makes the API accessible from any IP address on your network on port 8080.
The most unintuitive aspect of Ollama’s serve command is that it doesn’t just serve the latest models by default. If you have multiple versions of a model, e.g., llama2:7b and llama2:13b, and you simply run ollama run llama2, Ollama will pick the one it deems "latest" based on its internal tagging and download order, not necessarily the largest or most recent in terms of file modification. To guarantee a specific version, always use the tag: ollama run llama2:7b.
With these commands, you can effectively manage your local LLM environment, from downloading and running models to creating and serving your own custom AI agents.
The next step is often integrating these served models into your applications using their API.