Ollama CLI is more than just a way to download and run LLMs; it’s a surprisingly powerful tool for managing your local AI experiments.

Here’s Ollama in action, serving a model locally:

ollama serve &

Now, let’s interact with it using curl to see a model in action, just like the API would:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

This sends a request to the llama2 model running on your local Ollama server and gets a complete, non-streaming response.

Core Concepts & Commands

Ollama CLI is built around a few key entities: models, tags (versions of models), and the server itself.

1. Listing Available Models:

To see what you’ve got downloaded and ready to go:

ollama list

Output might look like:

NAME        ID              SIZE    MODIFIED
llama2:latest   f7601789a23b    3.8 GB  2 hours ago
mistral:latest  b67a6741452a    4.1 GB  1 day ago

This shows the model name, its unique ID (useful for some advanced operations), its disk usage, and when it was last updated.

2. Downloading Models:

The primary way to get models is ollama pull:

ollama pull llama3

This downloads the llama3 model with its latest tag. You can specify versions:

ollama pull mistral:7b

This gets the specific 7b parameter version of Mistral. The download progress will be displayed, showing chunks being pulled and the total size.

3. Running Models (Interactive Chat):

The most common use is interactive chat:

ollama run llama3

This drops you into an interactive prompt where you can chat with the llama3 model. Type your message, hit enter, and the model responds. Type /bye to exit.

4. Running Models (Single Prompt):

For quick, non-interactive responses:

ollama run llama3 "What is the capital of France?"

This will print the answer and immediately exit the prompt.

5. Creating and Managing Your Own Models (Modelfiles):

This is where Ollama’s power for customization shines. You define models using a Modelfile.

Example Modelfile for a simple RAG (Retrieval Augmented Generation) setup:

FROM ./my_embeddings_model
SYSTEM "You are a helpful assistant. Answer questions based on the provided context."
PARAMETER temperature 0.7
PARAMETER top_k 50
EMBED "This is a document about AI. It discusses the history and future of artificial intelligence."
EMBED "Another document about machine learning, a subset of AI."

To create a model from this:

ollama create my-rag-model -f ./Modelfile

This builds a new model named my-rag-model using the specified embedding model and injecting the provided text as context. The SYSTEM instruction primes the model’s behavior, and PARAMETER allows tuning generation settings.

6. Pushing Models to Ollama Hub (Self-Hosted):

If you’ve created custom models and want to share them within your organization or with the public, you can push them:

ollama push my-custom-model

This requires you to be logged in (ollama login) and assumes you have a model tagged appropriately.

7. Removing Models:

Free up disk space:

ollama rm llama2

This removes the llama2 model and all its associated tags.

8. Serving Models Locally:

To make your models available via API for other applications:

ollama serve

This starts the Ollama server in the foreground. For background operation, use ollama serve &. The API is typically served on http://localhost:11434.

9. Ollama Serve Configuration:

You can configure the server, for example, to change the host and port:

OLLAMA_HOST=0.0.0.0 OLLAMA_PORT=8080 ollama serve

This makes the API accessible from any IP address on your network on port 8080.

The most unintuitive aspect of Ollama’s serve command is that it doesn’t just serve the latest models by default. If you have multiple versions of a model, e.g., llama2:7b and llama2:13b, and you simply run ollama run llama2, Ollama will pick the one it deems "latest" based on its internal tagging and download order, not necessarily the largest or most recent in terms of file modification. To guarantee a specific version, always use the tag: ollama run llama2:7b.

With these commands, you can effectively manage your local LLM environment, from downloading and running models to creating and serving your own custom AI agents.

The next step is often integrating these served models into your applications using their API.

Want structured learning?

Take the full Ollama course →