Prometheus high availability (HA) setups are typically achieved by running multiple identical Prometheus instances behind a load balancer, which then forward data to a long-term storage solution like Thanos or Cortex.
Let’s dive into how this works and what makes it tick.
The Problem: A Single Prometheus is a Single Point of Failure
Imagine your entire monitoring system is a single Prometheus server. If that server goes down for maintenance, a bug, or even a hardware failure, your monitoring stops. Alerts won’t fire, dashboards go blank, and you’re effectively flying blind. This is unacceptable for any production environment.
The Solution: Redundancy and Centralized Storage
The standard HA approach involves two key components:
- Multiple Prometheus Instances: You run at least two (preferably more) identical Prometheus servers. Each server scrapes the same targets.
- Load Balancer: A load balancer sits in front of these Prometheus instances. It distributes incoming scrape requests to the available Prometheus servers. This ensures that if one Prometheus server is down, the load balancer simply sends requests to the healthy ones.
- Long-Term Storage (Thanos/Cortex): Instead of Prometheus storing data locally indefinitely (which becomes a storage and management nightmare), it’s configured to send its metrics to a separate, highly available, long-term storage system. Thanos and Cortex are popular choices here. They aggregate data from multiple Prometheus instances, provide querying capabilities across all historical data, and handle long-term retention.
How it Works in Practice: The Scrape and Send Dance
Let’s look at a common configuration.
Prometheus Configuration (prometheus.yml)
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'my-app'
static_configs:
- targets: ['192.168.1.10:8080', '192.168.1.11:8080']
- job_name: 'node_exporter'
static_configs:
- targets: ['192.168.1.20:9100', '192.168.1.21:9100']
remote_write:
- url: "http://thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive" # Or your Cortex endpoint
Key Points:
-
scrape_configs: These are identical across all your Prometheus HA instances. Each Prometheus server will scrape192.168.1.10:8080,192.168.1.11:8080, etc. This is intentional; it means each Prometheus has a full copy of the recent data. -
remote_write: This is crucial. It tells Prometheus to send all scraped metrics to your Thanos Receive or Cortex endpoint. This is how data gets to your long-term storage. -
remote_read(for querying): While not in thescrape_configs, your Prometheus instances will also be configured to read data from Thanos/Cortex for querying. This is typically done via the--web.enable-remote-readflag and a configuration like:read_from_remote_storage: remote_storage: - url: "http://thanos-query.monitoring.svc.cluster.local:10902" # Or your Cortex query endpoint
Thanos/Cortex Side: Receive and Query
- Thanos Receive (or Cortex Ingest): This component of Thanos (or Cortex’s ingester) listens for
remote_writerequests from your Prometheus instances. It de-duplicates data (since multiple Prometheis send the same metrics) and writes it to object storage (like S3, GCS) or its own distributed storage. - Thanos Query (or Cortex Query Frontend/Querier): This is what your Grafana or other dashboards connect to. It queries data from the long-term storage (Thanos Sidecars querying object storage, or Cortex’s queriers). It can also query directly from Prometheus instances for recent data not yet uploaded.
The Load Balancer:
A standard HTTP load balancer (like HAProxy, Nginx, or a cloud provider’s LB) sits in front of your Prometheus instances.
- For Scrapes: When your target applications (e.g.,
my-app) are configured to expose metrics athttp://<load-balancer-address>:9090/metrics, the load balancer directs these incoming scrape requests to one of the healthy Prometheus instances. This prevents the load balancer itself from becoming a single point of failure for collecting metrics, though it’s less common than having Prometheus pull. - More Commonly: Prometheus Pulls: The setup described above has Prometheus instances pulling from targets. The load balancer here is often used for querying Prometheus for its recent data, or for managing access to the Prometheus UI itself. The primary HA for collection is achieved by having multiple Prometheis scrape the same targets independently. The
remote_writeensures data still gets to long-term storage even if one Prometheus fails.
Common Causes of HA Failure (and How to Fix Them)
-
Thanos/Cortex Receive Down:
- Diagnosis: Check the logs of your Thanos Receive or Cortex ingester pods/processes. Look for errors related to writing to object storage or network connectivity.
- Fix: Ensure your Thanos Receive/Cortex deployment has sufficient replicas and resource limits. If using object storage, verify credentials and network access. For example, if using S3, ensure the IAM role or access keys are correctly configured and have
s3:PutObjectpermissions on the target bucket. - Why it works: The
remote_writeprotocol buffers data locally if the remote endpoint is unavailable. However, this buffer has a finite size. If the receive endpoint stays down too long, Prometheus will eventually drop data. Keeping the receive healthy ensures data flows to long-term storage.
-
Object Storage Issues:
- Diagnosis: Check Prometheus logs for errors like
failed to upload block,access denied, or network timeouts when writing to S3/GCS/etc. - Fix: Verify your object storage credentials, bucket permissions, and network connectivity. Ensure the region is correctly configured if applicable. If using S3, a common fix for
access deniedis to ensure the IAM user/role hass3:PutObject,s3:ListBucket, ands3:DeleteObjectpermissions on the relevant bucket and prefix. - Why it works: Thanos and Cortex rely on object storage for durable, long-term metric storage. If Prometheus can’t write to it, data is lost after the local buffer fills.
- Diagnosis: Check Prometheus logs for errors like
-
Prometheus
remote_writeConfiguration Error:- Diagnosis: Check Prometheus logs for errors like
unsupported protocol scheme,connection refused, or404 Not Foundwhen connecting to theremote_writeURL. - Fix: Double-check the
urlin yourremote_writeconfiguration. Ensure the protocol (httporhttps) and port are correct for your Thanos Receive or Cortex endpoint. For instance, if your Cortex ingester is running on port19009and you’re usinghttp, the URL should behttp://cortex-ingester.monitoring.svc.cluster.local:19009/ingest. - Why it works: This is the pipe through which Prometheus sends its data to long-term storage. A misconfiguration here means data never leaves Prometheus and is lost when Prometheus restarts or its local storage fills up.
- Diagnosis: Check Prometheus logs for errors like
-
Prometheus Local Storage Filling Up:
- Diagnosis: Prometheus logs might show
write error: ... No space left on deviceorout of memoryif it’s struggling to compact blocks. You can also checkprometheus_tsdb_head_chunksmetric. - Fix: Increase the
--storage.tsdb.retention.time(e.g., to24hor48h) if you have ample disk space, or ensureremote_writeis reliably sending data. If the issue persists, increase the disk size allocated to Prometheus. - Why it works: Prometheus keeps a limited amount of data locally (default 2 hours). If
remote_writeis failing or slow, and the local retention is too short, Prometheus will start dropping data it can’t store locally before it can send it.
- Diagnosis: Prometheus logs might show
-
Thanos/Cortex Query Performance Degradation:
- Diagnosis: Dashboards in Grafana might load extremely slowly or time out. Check the query logs for Thanos Query or Cortex Queriers for errors or long-running queries.
- Fix: This often points to undersized query components, inefficient queries from Grafana, or issues with the underlying object storage performance. Scale up your query components, optimize Grafana dashboards (e.g., reduce
lookback_deltaor use recording rules), or investigate object storage latency. - Why it works: While not a direct data loss scenario, a non-performant query layer makes the HA setup useless as users can’t access the data.
-
Network Connectivity Between Prometheus and Thanos/Cortex:
- Diagnosis: Prometheus logs showing
connection refused,i/o timeout, or DNS resolution errors when trying to reach theremote_writeendpoint. - Fix: Verify firewall rules, Kubernetes NetworkPolicies, or cloud security groups. Ensure the Prometheus pods can resolve and reach the Thanos Receive/Cortex ingester service FQDN and port. For example, in Kubernetes, check
kubectl get networkpolicyto ensure traffic is allowed. - Why it works: The
remote_writeprotocol requires a stable network connection. Any interruption prevents data from being sent to long-term storage.
- Diagnosis: Prometheus logs showing
The next error you’ll likely encounter after fixing these issues is a "no data points found" error in Grafana if your query configuration is pointing to the wrong Thanos/Cortex query endpoint, or if your Thanos Sidecars/Cortex Queriers are not properly configured to read from your object storage.