The most surprising thing about serving LLMs with Ollama on Kubernetes is how aggressively it fights against the very infrastructure designed to manage distributed workloads.

Let’s see it in action. Imagine you’ve got a Kubernetes cluster and you want to serve a model like Llama 3. You’d typically deploy Ollama as a Deployment with a Service to expose it.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          limits:
            nvidia.com/gpu: 1 # Request a GPU
          requests:
            nvidia.com/gpu: 1 # Request a GPU
        volumeMounts:
        - name: ollama-models
          mountPath: /root/.ollama
      volumes:
      - name: ollama-models
        persistentVolumeClaim:
          claimName: ollama-models-pvc

---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
    - protocol: TCP
      port: 11434
      targetPort: 11434
  type: ClusterIP

You’d also need a PersistentVolumeClaim to store your models persistently.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
spec:
  accessModes:
    - ReadWriteOnce # Or ReadWriteMany if your storage supports it and you need it across nodes
  resources:
    requests:
      storage: 100Gi # Adjust based on your model sizes

Once deployed, you can kubectl exec into a pod and run ollama pull llama3 to download the model. Then, you can curl http://localhost:11434/api/generate from inside the pod to test it. Exposing this externally would involve an Ingress or a LoadBalancer service.

The core problem Ollama presents in Kubernetes is its assumption of a single, persistent, and locally accessible model directory (/root/.ollama). When you scale Ollama pods or when pods reschedule to different nodes, each instance needs its own copy of the models, or a shared, performant, and accessible storage solution. Kubernetes, designed for stateless, ephemeral containers, inherently clashes with this stateful model management.

Here’s how Ollama actually works internally and why it’s tricky:

  1. Model Loading: When an Ollama server starts, it scans its /root/.ollama directory for available models. It loads the model weights and configuration into memory. This is a compute-intensive and time-consuming process.
  2. Inference Engine: It uses a custom inference engine (often based on llama.cpp or similar) to process requests. Each running Ollama instance manages its own inference threads and GPU resources.
  3. API Server: A web server (typically Go’s net/http) handles incoming API requests, queues them, and passes them to the inference engine.

The real levers you control are:

  • GPU Allocation: Ensuring your Kubernetes nodes have GPUs and that your Ollama pods request them correctly using nvidia.com/gpu: 1 (or the appropriate vendor/count).
  • Storage Strategy: This is paramount. You need a way for all your Ollama pods to access the same set of models, or for each pod to have its own, efficiently managed copy.
  • Resource Limits/Requests: Setting appropriate CPU, memory, and GPU requests and limits to prevent noisy neighbor issues and ensure stable performance.
  • Scaling: Deciding how many Ollama replicas you need based on expected load, and configuring Horizontal Pod Autoscalers if necessary, though autoscaling based on GPU utilization is complex.

The most common pitfall is a simple PersistentVolumeClaim that isn’t adequately shared or performant. If you use ReadWriteOnce with a typical cloud provider StorageClass that provisions a single-user disk, only one pod can mount it at a time. If your Ollama pods are scheduled on different nodes, they won’t see the same models. The fix is to use a StorageClass that supports ReadWriteMany (like NFS, CephFS, or certain cloud provider file shares) and ensure your PersistentVolumeClaim requests that accessMode. Alternatively, you can use an init container to copy models from a shared read-only volume to each pod’s local ephemeral storage, but this doubles download times on pod startup and requires significant local disk.

Another critical aspect is model management. Ollama’s CLI commands (ollama pull, ollama list) operate inside a specific pod. If you pull a model on one replica, the other replicas won’t automatically have it unless you’ve solved the shared storage problem. A common pattern is to have a single "model management" pod that pulls models and then makes them available via shared storage, or to use a custom entrypoint script that checks for model existence and pulls them if missing, but this leads to race conditions and duplicate downloads if not carefully orchestrated.

The true challenge is achieving high availability and efficient scaling. If one Ollama pod dies, its models are gone unless they are on persistent storage. If you scale up, each new pod needs to load the models, which can take minutes and saturate your storage or network. This is why many production setups bypass Ollama’s internal model management and instead use a system where models are pre-loaded onto nodes or served via a dedicated model repository accessible by inference servers that don’t manage model downloads themselves.

If you’ve got your Ollama pods running and serving models, the next hurdle you’ll likely face is managing the stateful nature of model updates across your distributed fleet.

Want structured learning?

Take the full Ollama course →