Prometheus’s remote storage read failed because the query router, prom-gateway, couldn’t retrieve data from the configured remote read endpoint, querier-service. This is critical because it means Prometheus can’t serve historical metrics, essentially crippling its time-series analysis capabilities.

The most common culprit is a network misconfiguration. Prometheus itself is likely fine, but it’s being blocked from talking to querier-service.

Cause 1: Network Policy Blocking

  • Diagnosis: Check Prometheus’s network policies. If you’re using Kubernetes, this would be a NetworkPolicy object. Look for a policy that explicitly denies egress traffic from the Prometheus pod’s namespace to the querier-service’s namespace or port.
    kubectl get networkpolicy -n <prometheus-namespace>
    kubectl describe networkpolicy <networkpolicy-name> -n <prometheus-namespace>
    
    You’re looking for a policyTypes: [Egress] section that either lacks an egress rule allowing traffic to the querier-service’s IP range and port (usually 8080 or 9090), or has a specific deny rule.
  • Fix: Add or modify a NetworkPolicy to allow egress from Prometheus to the querier.
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: allow-prom-to-querier
      namespace: <prometheus-namespace>
    spec:
      podSelector: {} # Applies to all pods in the namespace, or specify prometheus pod label
      policyTypes:
      - Egress
      egress:
      - to:
        - ipBlock:
            cidr: <querier-service-cidr> # e.g., 10.0.1.0/24 or specific service IP
        ports:
        - protocol: TCP
          port: 8080 # Or the port your querier-service listens on
    
  • Why it works: Network Policies in Kubernetes act like firewalls. By explicitly allowing egress traffic on the correct port and destination, you’re removing the block that was preventing Prometheus from reaching its remote read backend.

Cause 2: Incorrect remote_read Configuration in Prometheus

  • Diagnosis: Examine Prometheus’s configuration file (usually prometheus.yml or via its API if dynamically managed). The remote_read section must point to the correct URL of your remote read endpoint.
    # Example prometheus.yml snippet
    remote_read:
      - url: "http://querier-service.monitoring.svc.cluster.local:8080/read"
        remote_timeout: 30s
        read_shards: 1
    
    Verify the hostname, port, and path (/read) precisely match the querier-service’s exposed endpoint.
  • Fix: Correct the url parameter in your Prometheus configuration. For example, if querier-service is in the monitoring namespace and exposed on port 8080, the URL should be http://querier-service.monitoring.svc.cluster.local:8080/read. Ensure remote_timeout is also reasonable (e.g., 30s).
  • Why it works: Prometheus needs the exact address to know where to send its read requests. A typo or incorrect service discovery name means it’s trying to connect to a non-existent or wrong address.

Cause 3: querier-service is Unhealthy or Not Ready

  • Diagnosis: Check the health and readiness of the querier-service pods.
    kubectl get pods -n <querier-service-namespace> -l app=<querier-service-label>
    kubectl describe pod <querier-pod-name> -n <querier-service-namespace>
    kubectl logs <querier-pod-name> -n <querier-service-namespace>
    
    Look for pods that are CrashLoopBackOff, Error, or not in a Running and Ready state. Check logs for errors related to connecting to its own data source (e.g., object storage, database).
  • Fix: Troubleshoot the querier-service itself. This might involve restarting the pods, scaling them up if they’re overloaded, or fixing errors in their logs (e.g., authentication issues with object storage, database connection problems).
  • Why it works: If the querier-service isn’t running or is failing to start, it cannot respond to Prometheus’s read requests, leading to the error.

Cause 4: Resource Exhaustion on querier-service

  • Diagnosis: Monitor the CPU and memory usage of the querier-service pods. If they are consistently hitting their resource limits, they may become unresponsive or start failing requests.
    kubectl top pods -n <querier-service-namespace> -l app=<querier-service-label>
    
    Also, check the resource requests and limits defined in the querier-service’s deployment.
  • Fix: Increase the CPU and memory limits for the querier-service pods in its deployment configuration.
    # Example snippet from querier-service deployment
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "2"
        memory: "4Gi" # Increase this if it's hitting the limit
    
  • Why it works: When a service is starved of CPU or memory, it cannot process incoming requests efficiently, leading to timeouts and read failures. Providing more resources allows it to handle the load.

Cause 5: TLS/SSL Certificate Issues

  • Diagnosis: If querier-service is configured to use HTTPS, Prometheus might be failing to validate its certificate. Check Prometheus logs for errors like x509: certificate signed by unknown authority or remote: tls: handshake failed.
  • Fix: Ensure Prometheus trusts the CA that signed the querier-service’s certificate. This might involve:
    • Configuring Prometheus to load custom CAs: Add ca_file or ca_dir to the remote_read configuration.
    • Ensuring the querier-service uses a certificate signed by a publicly trusted CA.
    • If using self-signed certificates, ensure the CA certificate is correctly distributed and referenced in Prometheus’s configuration.
  • Why it works: TLS ensures secure communication. If Prometheus cannot verify the identity of querier-service via its certificate, it will refuse to connect, similar to a network block.

Cause 6: High Latency or Network Congestion Between Prometheus and querier-service

  • Diagnosis: Use ping or traceroute from the Prometheus pod to the querier-service’s IP address. High latency or packet loss can cause requests to exceed Prometheus’s configured remote_timeout.
    # Exec into prometheus pod
    kubectl exec -it <prometheus-pod-name> -n <prometheus-namespace> -- ping <querier-service-ip>
    kubectl exec -it <prometheus-pod-name> -n <prometheus-namespace> -- traceroute <querier-service-ip>
    
  • Fix: Increase the remote_timeout in Prometheus’s remote_read configuration. A value of 60s or 90s might be necessary if network latency is consistently high.
    remote_read:
      - url: "http://querier-service.monitoring.svc.cluster.local:8080/read"
        remote_timeout: 60s # Increased timeout
    
  • Why it works: The remote_timeout dictates how long Prometheus will wait for a response. If network conditions are poor, requests can take longer than the default timeout, causing Prometheus to abandon the request and report an error.

After resolving these, you’ll likely encounter a "context deadline exceeded" error if the underlying storage for querier-service is slow or unavailable.

Want structured learning?

Take the full Prometheus course →