Ollama Prometheus Metrics: Monitor LLM Serving (2026)

Ollama’s Prometheus metrics are surprisingly stateless, focusing on the current state and ephemeral request details rather than historical trends.

Let’s see what Ollama is spitting out. First, make sure you have Ollama running and have enabled Prometheus metrics. This is usually done by setting OLLAMA_HOST=0.0.0.0:11434 (or your desired IP/port) and OLLAMA_ENABLE_METRICS=true in your environment or Ollama’s config file. Then, you can curl the metrics endpoint:

curl http://localhost:11434/metrics

You’ll see output like this:

# HELP ollama_build_info Ollama build information
# TYPE ollama_build_info gauge
ollama_build_info{version="0.1.32"} 1
# HELP ollama_http_requests_total Total number of HTTP requests received by Ollama
# TYPE ollama_http_requests_total counter
ollama_http_requests_total{method="POST",path="/api/generate",status="200"} 15
ollama_http_requests_total{method="POST",path="/api/generate",status="500"} 2
# HELP ollama_http_request_duration_seconds Duration of HTTP requests
# TYPE ollama_http_request_duration_seconds histogram
ollama_http_request_duration_seconds_bucket{method="POST",path="/api/generate",status="200",le="0.1"} 5
ollama_http_request_duration_seconds_bucket{method="POST",path="/api/generate",status="200",le="0.2"} 10
...
# HELP ollama_model_inference_duration_seconds Duration of model inference
# TYPE ollama_model_inference_duration_seconds histogram
ollama_model_inference_duration_seconds_bucket{model="llama2",le="1"} 3
ollama_model_inference_duration_seconds_bucket{model="llama2",le="2"} 7
...
# HELP ollama_model_load_duration_seconds Duration of model loading
# TYPE ollama_model_load_duration_seconds histogram
ollama_model_load_duration_seconds{model="llama2"} 5.234
# HELP ollama_model_load_status Status of model loading
# TYPE ollama_model_load_status gauge
ollama_model_load_status{model="llama2",status="loaded"} 1
ollama_model_load_status{model="llama2",status="loading"} 0

The core problem Ollama’s metrics solve is understanding the health and performance of your LLM serving layer. Without them, you’re flying blind. You don’t know if requests are timing out, if models are taking too long to load, or which models are actually being used. These metrics give you visibility.

Internally, Ollama exposes these metrics via an HTTP endpoint, typically /metrics. Prometheus, a popular monitoring system, scrapes this endpoint at regular intervals. It collects time-series data for each metric. For example, ollama_http_requests_total is a counter that increments with every request. Prometheus stores these increments, allowing you to calculate rates (requests per second) and understand traffic volume over time. ollama_http_request_duration_seconds is a histogram, which quantifies the distribution of request latencies. This tells you not just the average time, but how many requests fell into specific latency buckets (e.g., less than 0.1 seconds, 0.1 to 0.2 seconds, etc.).

The ollama_model_inference_duration_seconds metric is crucial for understanding LLM performance. It measures how long it takes for the model to process a prompt and generate a response. High inference times directly impact user experience and can indicate a need for hardware upgrades or model optimization. The ollama_model_load_duration_seconds and ollama_model_load_status metrics are vital for managing your model catalog. They show how long it takes to bring a model into memory and whether models are successfully loaded, which is essential for a responsive service.

A subtle but important aspect is how Ollama handles model context. While there isn’t a direct metric for "tokens processed per second" across all models, the ollama_model_inference_duration_seconds metric, combined with knowledge of the prompt and response lengths for specific requests, can allow you to derive this. If you log request/response sizes alongside Prometheus scrape data (or use a more advanced tracing system), you can correlate token counts with inference durations to get a per-token processing speed. For instance, if a request for a 1000-token response took 10 seconds, and you know the prompt was 50 tokens, you’re looking at roughly 105 tokens processed per second for that specific model and request.

The ollama_build_info metric is a simple gauge that tells you the version of Ollama currently running. This is incredibly useful for debugging, as you can quickly correlate performance issues or unexpected behavior with a specific Ollama version. Always ensure your Prometheus configuration is set to scrape the /metrics endpoint with an appropriate interval, typically between 15 and 60 seconds, to capture meaningful data without overwhelming your Prometheus instance or Ollama.

Once you have these metrics in Prometheus, you’ll want to visualize them in Grafana. You’ll create panels for request rates, latency distributions, inference durations per model, and model load times. This allows you to set up alerts for anomalies, such as sudden spikes in error rates or unusually long inference times, which would trigger notifications via tools like Alertmanager.

The next step after monitoring is usually optimizing. You’ll likely start looking into metrics like ollama_model_inference_duration_seconds and correlating them with ollama_http_request_duration_seconds to pinpoint if the bottleneck is the model itself or the surrounding HTTP serving infrastructure.