The up metric in Prometheus is a binary indicator, 1 if the scrape was successful, 0 if it failed.
Let’s watch it in action. Imagine we have a target we’re trying to scrape, http://localhost:9090/metrics.
# On a separate terminal, running a simple HTTP server that exposes /metrics
python3 -m http.server 9090
Now, configure Prometheus to scrape it. In your prometheus.yml:
scrape_configs:
- job_name: 'local_test'
static_configs:
- targets: ['localhost:9090']
Start Prometheus. If everything is configured correctly, you’ll see up{job="local_test", instance="localhost:9090"} 1.
If you stop the Python server:
# Press Ctrl+C in the terminal running the Python server
Within seconds, Prometheus will try to scrape and fail. The up metric for that target will drop to 0.
up{job="local_test", instance="localhost:9090"} 0
This simple metric is the most fundamental indicator of Prometheus’s ability to reach and retrieve data from your targets. It’s the first line of defense in understanding your monitoring system’s health.
The up metric is generated by Prometheus itself, not by the target being scraped. When Prometheus attempts to scrape a target, it records the outcome. A successful scrape (HTTP status code 2xx or 3xx) results in up being 1. Any failure – network error, timeout, non-2xx/3xx status code – sets up to 0.
The real power comes from querying this metric. To see all targets that are currently down:
up == 0
To alert when a specific service has been down for more than 5 minutes:
alert: ServiceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
This alert uses the for: 5m clause, meaning the condition up == 0 must be true for a continuous duration of 5 minutes before the alert fires. This prevents flapping alerts for transient network glitches.
Understanding up is crucial for building reliable monitoring. It tells you if Prometheus can even talk to your application’s metrics endpoint. A sea of 0s here means your Prometheus setup isn’t collecting any data, regardless of how well your applications are instrumented.
The up metric is essential for Service Discovery. When Prometheus uses dynamic service discovery (like Kubernetes, Consul, or EC2), it discovers targets. The up metric is then immediately associated with these discovered targets, allowing you to monitor the health of your dynamically managed infrastructure. If a pod in Kubernetes restarts, Prometheus will detect the IP address change and start scraping the new pod, with up reflecting its immediate availability.
A common pitfall is mistaking up == 0 for a problem within the target application. While a target being down often is a symptom of an application issue, the up metric itself is a statement about Prometheus’s connectivity to that target. The target might be perfectly healthy but unreachable due to network policy, firewall rules, or Prometheus itself being misconfigured to point to the wrong address.
Consider the scrape configuration itself. If you have scrape_interval: 15s and evaluation_interval: 10s, an up == 0 state will be detected by Prometheus and potentially trigger an alert within 10-25 seconds of the initial scrape failure. This rapid feedback loop is key to operational agility.
When you query up, you’ll notice labels like job and instance are automatically attached. These are crucial for distinguishing which target is failing. If you have multiple instances of a service, up will show 0 for each individual failing instance.
The up metric is a simple boolean, but its implications are vast. It’s the bedrock of your monitoring’s observability. Without it, you’re flying blind, unable to tell if your monitoring system is even operational.
The next logical step after ensuring all your targets are up is to investigate the quality of the data being scraped, which involves looking at metrics from your targets themselves.