Pushgateway is a terrible idea for most use cases, and you should probably avoid it.
Let’s see how it actually works, and why that makes it so problematic. Imagine you have a batch job, something that runs once an hour, calculates a metric, and then exits.
#!/bin/bash
# Simulate some work
sleep 10
# Calculate a metric
METRIC_VALUE=$(shuf -i 1-100 -n 1)
# Push the metric to Prometheus Pushgateway
curl -X POST --data "metric_name{label1=\"value1\"} $METRIC_VALUE" http://pushgateway.example.com:9091/metrics/job/my_batch_job/instance/unique_id_$(date +%s)
This script does its job, pushes a metric, and then dies. Prometheus, by default, pulls metrics from targets. It scrapes configured endpoints periodically. But what if your job can’t be scraped because it’s ephemeral? That’s where Pushgateway comes in. It provides an endpoint where clients can push metrics to Prometheus.
The core problem: Pushgateway is a short-term storage for metrics. When your job pushes a metric, Pushgateway stores it. Prometheus then scrapes Pushgateway. But Pushgateway doesn’t store metrics forever. It has a configurable max_same_family_size (default 10) and max_distinct_labels (default 1000) per metric family. More importantly, it has no built-in mechanism to expire old metrics from jobs that have stopped running.
This means Pushgateway becomes a graveyard of stale metrics. If your my_batch_job runs, pushes a metric, and then fails to run again for a while, or if you change the labels on its metrics, those old metrics will stick around in Pushgateway indefinitely. Prometheus will keep scraping them, and you’ll see metrics from jobs that are long dead, or metrics with outdated label combinations.
The common advice is to use Pushgateway for metrics from short-lived jobs (like batch processes or tests) that cannot be scraped directly. The intent is that the job pushes its final state, and then the metric is no longer relevant.
But here’s the trap: If the job fails after pushing a metric but before it’s supposed to run again, or if the job is simply removed from your infrastructure without a corresponding cleanup of its Pushgateway metrics, those metrics linger. You end up with a growing collection of useless, potentially misleading, data.
The intended workflow looks like this:
- A short-lived job starts.
- It performs some work and calculates metrics.
- It pushes these metrics to Pushgateway.
- The job exits.
- Prometheus scrapes Pushgateway and observes the metrics.
- Eventually, the job runs again, pushes new metrics, and ideally overwrites or supersedes the old ones.
The reality is often:
- Job runs, pushes metrics.
- Job exits.
- Job fails to run again.
- Pushgateway still holds the old metrics.
- Prometheus shows metrics from a dead job.
If you absolutely must use Pushgateway, you need a strategy for metric expiration. This typically involves using a unique instance label for each execution of your job, often derived from a timestamp or a unique job run ID.
# Example with a unique instance label per run
INSTANCE_ID="run_$(date +%Y%m%d_%H%M%S)_$$"
curl -X POST --data "my_ephemeral_metric{job=\"my_batch_job\", instance=\"$INSTANCE_ID\"} 1" http://pushgateway.example.com:9091/metrics/job/my_batch_job/instance/$INSTANCE_ID
By making each job run push to a unique instance within its job label, you allow Pushgateway to manage families of metrics. However, Pushgateway itself doesn’t automatically delete these old instances when the job is truly gone. You’d still need a separate cleanup mechanism, perhaps another script that periodically queries Pushgateway for metrics associated with jobs that are no longer active and deletes them. This is complex and error-prone.
The fundamental issue is that Pushgateway forces you to think about metric lifecycle management on the client side, which is where it belongs least. When a job dies, its metrics should ideally just stop appearing, not linger in a special-purpose sink.
The most common way Pushgateway causes pain is by silently accumulating stale data from failed or retired services, leading to incorrect alerts and dashboards that reflect historical states rather than current reality. This accumulation is insidious because it doesn’t break anything immediately; it just degrades the reliability of your monitoring over time.
If you find yourself needing Pushgateway, pause and consider if there’s any way to expose a scraping endpoint, even for a short duration, or if the metric can be aggregated and pushed by a more stable, long-lived service.