You can actually serve multiple Ollama models on the same machine simultaneously, and it’s much less of a resource hog than you’d think, because Ollama is designed to be efficient with VRAM.

Let’s watch it in action.

First, make sure you have a couple of models downloaded. We’ll use llama3 and mistral for this example.

ollama pull llama3
ollama pull mistral

Now, the magic happens when you start serving them. Ollama uses a separate process for each model, but they share the underlying VRAM effectively. To serve them, you’ll use the serve command for each model, specifying a different port.

ollama serve --host 0.0.0.0 --port 11434 &
ollama serve --host 0.0.0.0 --port 11435 &

The & at the end of each command sends the process to the background, so you can continue using your terminal. You’ll see output indicating that each server is running on its assigned port.

Now, you can interact with these models independently using their respective ports. Let’s send a request to llama3 on port 11434 and to mistral on port 11435 using curl.

Request to Llama 3 (Port 11434):

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Request to Mistral (Port 11435):

curl http://localhost:11435/api/generate -d '{
  "model": "mistral",
  "prompt": "Write a short poem about a cat.",
  "stream": false
}'

You’ll get separate JSON responses, each containing the output from the respective model. This demonstrates that Ollama is indeed serving both models concurrently and routing requests correctly based on the port.

The core problem Ollama solves here is simplifying the deployment and management of multiple LLMs. Traditionally, running multiple models meant complex configurations, separate environments, and often duplicated resource allocation. Ollama’s design abstracts this away. Each ollama serve instance acts as an independent API endpoint for a specific model. When you send a request to localhost:11434, Ollama’s internal routing directs that to the llama3 server. Similarly, requests to localhost:11435 go to the mistral server.

Internally, Ollama manages the GPU (VRAM) allocation. While each model might be loaded into VRAM, Ollama is quite adept at sharing this resource. When a model isn’t actively generating, its VRAM footprint can be reduced or managed more efficiently, allowing other models to utilize the GPU. This isn’t true "model parallelism" where a single model is split across GPUs, but rather "model serving" where multiple independent models are loaded and ready to respond. The key levers you control are the ports you assign to each served model and the model name you specify in your API requests.

What’s often overlooked is how Ollama handles model loading and unloading. If you start serving a model and then stop that serve process, Ollama will eventually unload that model from VRAM if it’s not being used by any other active serve instance. This automatic resource management is crucial for keeping VRAM usage under control when you have many models you might want to access but don’t need all of them actively running and consuming full VRAM at all times. The ollama serve command initiates a persistent process that keeps the model loaded and ready for inference.

The next step is to explore how to manage these concurrent servers more robustly, perhaps with a tool that can automatically restart them if they crash or manage their lifecycle based on demand.

Want structured learning?

Take the full Ollama course →