Prometheus high availability (HA) setups are typically achieved by running multiple identical Prometheus instances behind a load balancer, which then forward data to a long-term storage solution like Thanos or Cortex.

Let’s dive into how this works and what makes it tick.

The Problem: A Single Prometheus is a Single Point of Failure

Imagine your entire monitoring system is a single Prometheus server. If that server goes down for maintenance, a bug, or even a hardware failure, your monitoring stops. Alerts won’t fire, dashboards go blank, and you’re effectively flying blind. This is unacceptable for any production environment.

The Solution: Redundancy and Centralized Storage

The standard HA approach involves two key components:

  1. Multiple Prometheus Instances: You run at least two (preferably more) identical Prometheus servers. Each server scrapes the same targets.
  2. Load Balancer: A load balancer sits in front of these Prometheus instances. It distributes incoming scrape requests to the available Prometheus servers. This ensures that if one Prometheus server is down, the load balancer simply sends requests to the healthy ones.
  3. Long-Term Storage (Thanos/Cortex): Instead of Prometheus storing data locally indefinitely (which becomes a storage and management nightmare), it’s configured to send its metrics to a separate, highly available, long-term storage system. Thanos and Cortex are popular choices here. They aggregate data from multiple Prometheus instances, provide querying capabilities across all historical data, and handle long-term retention.

How it Works in Practice: The Scrape and Send Dance

Let’s look at a common configuration.

Prometheus Configuration (prometheus.yml)

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['192.168.1.10:8080', '192.168.1.11:8080']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['192.168.1.20:9100', '192.168.1.21:9100']

remote_write:
  - url: "http://thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive" # Or your Cortex endpoint

Key Points:

  • scrape_configs: These are identical across all your Prometheus HA instances. Each Prometheus server will scrape 192.168.1.10:8080, 192.168.1.11:8080, etc. This is intentional; it means each Prometheus has a full copy of the recent data.

  • remote_write: This is crucial. It tells Prometheus to send all scraped metrics to your Thanos Receive or Cortex endpoint. This is how data gets to your long-term storage.

  • remote_read (for querying): While not in the scrape_configs, your Prometheus instances will also be configured to read data from Thanos/Cortex for querying. This is typically done via the --web.enable-remote-read flag and a configuration like:

    read_from_remote_storage:
      remote_storage:
        - url: "http://thanos-query.monitoring.svc.cluster.local:10902" # Or your Cortex query endpoint
    

Thanos/Cortex Side: Receive and Query

  • Thanos Receive (or Cortex Ingest): This component of Thanos (or Cortex’s ingester) listens for remote_write requests from your Prometheus instances. It de-duplicates data (since multiple Prometheis send the same metrics) and writes it to object storage (like S3, GCS) or its own distributed storage.
  • Thanos Query (or Cortex Query Frontend/Querier): This is what your Grafana or other dashboards connect to. It queries data from the long-term storage (Thanos Sidecars querying object storage, or Cortex’s queriers). It can also query directly from Prometheus instances for recent data not yet uploaded.

The Load Balancer:

A standard HTTP load balancer (like HAProxy, Nginx, or a cloud provider’s LB) sits in front of your Prometheus instances.

  • For Scrapes: When your target applications (e.g., my-app) are configured to expose metrics at http://<load-balancer-address>:9090/metrics, the load balancer directs these incoming scrape requests to one of the healthy Prometheus instances. This prevents the load balancer itself from becoming a single point of failure for collecting metrics, though it’s less common than having Prometheus pull.
  • More Commonly: Prometheus Pulls: The setup described above has Prometheus instances pulling from targets. The load balancer here is often used for querying Prometheus for its recent data, or for managing access to the Prometheus UI itself. The primary HA for collection is achieved by having multiple Prometheis scrape the same targets independently. The remote_write ensures data still gets to long-term storage even if one Prometheus fails.

Common Causes of HA Failure (and How to Fix Them)

  1. Thanos/Cortex Receive Down:

    • Diagnosis: Check the logs of your Thanos Receive or Cortex ingester pods/processes. Look for errors related to writing to object storage or network connectivity.
    • Fix: Ensure your Thanos Receive/Cortex deployment has sufficient replicas and resource limits. If using object storage, verify credentials and network access. For example, if using S3, ensure the IAM role or access keys are correctly configured and have s3:PutObject permissions on the target bucket.
    • Why it works: The remote_write protocol buffers data locally if the remote endpoint is unavailable. However, this buffer has a finite size. If the receive endpoint stays down too long, Prometheus will eventually drop data. Keeping the receive healthy ensures data flows to long-term storage.
  2. Object Storage Issues:

    • Diagnosis: Check Prometheus logs for errors like failed to upload block, access denied, or network timeouts when writing to S3/GCS/etc.
    • Fix: Verify your object storage credentials, bucket permissions, and network connectivity. Ensure the region is correctly configured if applicable. If using S3, a common fix for access denied is to ensure the IAM user/role has s3:PutObject, s3:ListBucket, and s3:DeleteObject permissions on the relevant bucket and prefix.
    • Why it works: Thanos and Cortex rely on object storage for durable, long-term metric storage. If Prometheus can’t write to it, data is lost after the local buffer fills.
  3. Prometheus remote_write Configuration Error:

    • Diagnosis: Check Prometheus logs for errors like unsupported protocol scheme, connection refused, or 404 Not Found when connecting to the remote_write URL.
    • Fix: Double-check the url in your remote_write configuration. Ensure the protocol (http or https) and port are correct for your Thanos Receive or Cortex endpoint. For instance, if your Cortex ingester is running on port 19009 and you’re using http, the URL should be http://cortex-ingester.monitoring.svc.cluster.local:19009/ingest.
    • Why it works: This is the pipe through which Prometheus sends its data to long-term storage. A misconfiguration here means data never leaves Prometheus and is lost when Prometheus restarts or its local storage fills up.
  4. Prometheus Local Storage Filling Up:

    • Diagnosis: Prometheus logs might show write error: ... No space left on device or out of memory if it’s struggling to compact blocks. You can also check prometheus_tsdb_head_chunks metric.
    • Fix: Increase the --storage.tsdb.retention.time (e.g., to 24h or 48h) if you have ample disk space, or ensure remote_write is reliably sending data. If the issue persists, increase the disk size allocated to Prometheus.
    • Why it works: Prometheus keeps a limited amount of data locally (default 2 hours). If remote_write is failing or slow, and the local retention is too short, Prometheus will start dropping data it can’t store locally before it can send it.
  5. Thanos/Cortex Query Performance Degradation:

    • Diagnosis: Dashboards in Grafana might load extremely slowly or time out. Check the query logs for Thanos Query or Cortex Queriers for errors or long-running queries.
    • Fix: This often points to undersized query components, inefficient queries from Grafana, or issues with the underlying object storage performance. Scale up your query components, optimize Grafana dashboards (e.g., reduce lookback_delta or use recording rules), or investigate object storage latency.
    • Why it works: While not a direct data loss scenario, a non-performant query layer makes the HA setup useless as users can’t access the data.
  6. Network Connectivity Between Prometheus and Thanos/Cortex:

    • Diagnosis: Prometheus logs showing connection refused, i/o timeout, or DNS resolution errors when trying to reach the remote_write endpoint.
    • Fix: Verify firewall rules, Kubernetes NetworkPolicies, or cloud security groups. Ensure the Prometheus pods can resolve and reach the Thanos Receive/Cortex ingester service FQDN and port. For example, in Kubernetes, check kubectl get networkpolicy to ensure traffic is allowed.
    • Why it works: The remote_write protocol requires a stable network connection. Any interruption prevents data from being sent to long-term storage.

The next error you’ll likely encounter after fixing these issues is a "no data points found" error in Grafana if your query configuration is pointing to the wrong Thanos/Cortex query endpoint, or if your Thanos Sidecars/Cortex Queriers are not properly configured to read from your object storage.

Want structured learning?

Take the full Prometheus course →