Prometheus is failing to scrape metrics from a target, causing your "Target Down" alert to fire. This usually means Prometheus can’t establish a connection to the target service or the target service is rejecting Prometheus’s requests.
Here’s how to break down the problem and fix it:
1. Network Connectivity
Diagnosis: The most common culprit is a network issue preventing Prometheus from reaching the target.
Check: From the Prometheus server (or wherever Prometheus is running), try to curl the target’s metrics endpoint. For example, if your target is 192.168.1.100:9090 and its metrics path is /metrics, run:
curl http://192.168.1.100:9090/metrics
If this fails with "Connection refused" or "No route to host," it’s a network problem.
Fix:
- Firewall Rules: Ensure your firewall (on the Prometheus server, the target server, or any network devices in between) allows traffic on the target’s port (e.g., 9090). On Linux, you might use
ufw allow 9090/tcporfirewall-cmd --zone=public --add-port=9090/tcp --permanent && firewall-cmd --reload. - Network Routing: Verify that the Prometheus server can route traffic to the target’s IP address.
ping <target_ip>can help here, though ICMP might be blocked. Ifpingworks, check your network configuration (ip route showon Linux). - DNS Resolution: If you’re using hostnames in your Prometheus configuration, ensure DNS resolution is working correctly from the Prometheus server. Use
dig <target_hostname>ornslookup <target_hostname>. If it fails, check your/etc/resolv.confor your DNS server configuration.
Why it works: This step confirms that Prometheus can physically reach the target on the network. If it can’t, Prometheus has no hope of scraping metrics.
2. Target Service Not Running or Listening
Diagnosis: The application on the target machine that’s supposed to expose metrics might not be running, or it’s not listening on the expected port.
Check: On the target machine itself, check if the process is running and listening on the correct port. Use ss -tulnp | grep 9090 (replace 9090 with your target’s port). You should see a line indicating a process is listening on 0.0.0.0:9090 or <specific_ip>:9090.
Fix:
- Start the Service: If the process isn’t running, start it using its service manager (e.g.,
systemctl start my-app.serviceordocker start my-container). - Configure Listening Address: If the application is running but not listening on the correct IP address or port, you’ll need to adjust its configuration. This varies greatly by application. For example, an Nginx exporter might have its
listendirective misconfigured, or a custom application might need its port setting changed in an environment variable or config file.
Why it works: Prometheus can only scrape metrics from a service that is actively running and bound to a network port.
3. Prometheus Configuration Errors
Diagnosis: The Prometheus configuration (prometheus.yml) might have typos, incorrect IP addresses, ports, or scrape paths.
Check: Carefully review the scrape_configs section in your prometheus.yml. Pay close attention to the static_configs (if used) or the service discovery configuration. Ensure the targets list contains the correct IP addresses/hostnames and ports for your services.
Example prometheus.yml snippet:
scrape_configs:
- job_name: 'my-application'
static_configs:
- targets: ['192.168.1.100:9090']
labels:
env: 'production'
Fix: Correct any typos, incorrect IP addresses, ports, or hostnames in the prometheus.yml file. After modifying, reload the Prometheus configuration by sending a SIGHUP signal or by making an HTTP POST request to the /-/reload endpoint: curl -X POST http://localhost:9090/-/reload.
Why it works: Prometheus uses this configuration file to know where and how to scrape targets. An error here means it’s looking in the wrong place or with the wrong parameters.
4. Target Metrics Path Incorrect
Diagnosis: Prometheus is connecting to the target, but the /metrics path (or whatever path is configured) is wrong or doesn’t exist on the target.
Check: Again, use curl from the Prometheus server: curl http://<target_ip>:<target_port>/metrics. If you get a 404 Not Found or a different error than a successful metrics output, the path is likely incorrect.
Fix:
-
Update
prometheus.yml: If the metrics path is different (e.g.,/probe_metricsor/metrics/v1), update themetrics_pathparameter in yourprometheus.ymlfor thatjob_name.scrape_configs: - job_name: 'my-application' metrics_path: /my_custom_metrics static_configs: - targets: ['192.168.1.100:9090'] -
Configure Target Application: If the target application is not exposing metrics at all, you’ll need to configure it to do so. This is application-specific.
Why it works: Prometheus needs to know the exact URL endpoint on the target where metrics are served.
5. Target Service Overwhelmed or Crashing
Diagnosis: The target application might be running but is so overloaded that it cannot respond to Prometheus’s scrape requests in time, or it’s crashing repeatedly.
Check:
- Target Logs: Examine the logs of the target application. Look for errors, out-of-memory (OOM) conditions, or repeated restarts.
- Target Resource Usage: Check the CPU, memory, and network utilization on the target machine. High resource usage can indicate it’s struggling to keep up. Use
top,htop, or cloud provider monitoring tools. - Prometheus Scrape Duration: In Prometheus’s own UI (usually
http://<prometheus_ip>:9090/targets), look at the "Last Scrape Duration" for the failing target. If it’s consistently high or "unknown," the target is slow or unresponsive.
Fix:
- Scale Up/Out: Increase the resources (CPU, RAM) available to the target application or scale out the number of instances if it’s a distributed system.
- Optimize Application: Profile and optimize the target application to reduce its resource consumption or improve its ability to handle load.
- Adjust Scrape Interval/Timeout: If the target is legitimately slow but functional, you can increase Prometheus’s scrape interval (e.g., from
15sto30s) inprometheus.ymlor increase the scrape timeout (scrape_timeoutinprometheus.yml, default is 10s) to give it more time. Be cautious with these, as they can mask underlying issues.
Why it works: If the target can’t even respond to a simple HTTP request within a reasonable time, Prometheus will mark it as down. Addressing the target’s performance issues allows it to respond.
6. TLS/SSL Configuration Issues
Diagnosis: If your target is configured to use HTTPS, but Prometheus is not configured to trust the certificate or is using the wrong TLS settings.
Check:
- Prometheus UI: Navigate to
http://<prometheus_ip>:9090/targets. For the failing target, check the "Error" column. It might contain specific TLS-related errors like "x509: certificate signed by unknown authority" or "connection refused" if TLS handshake fails. - Curl with TLS: From the Prometheus server, try
curl -v https://<target_ip>:<target_port>/metrics. The-vflag will show TLS handshake details.
Fix:
-
tls_configinprometheus.yml: Ensure thetls_configsection for the job is correctly set up.- If using self-signed certificates or internal CAs: Specify
ca_file,cert_file, andkey_fileas needed. Theinsecure_skip_verify: trueoption can be used for testing but is not recommended for production. - If the target’s certificate is valid but not trusted by the system Prometheus is running on, you may need to add the CA certificate to the system’s trust store or use
ca_fileintls_config.
scrape_configs: - job_name: 'my-secure-app' scheme: https tls_config: ca_file: /etc/prometheus/certs/ca.crt cert_file: /etc/prometheus/certs/prometheus.crt key_file: /etc/prometheus/certs/prometheus.key # insecure_skip_verify: true # Use with caution! static_configs: - targets: ['secure-target.example.com:8443'] - If using self-signed certificates or internal CAs: Specify
-
Target Certificate Validity: Ensure the target’s certificate is not expired and is valid for the hostname Prometheus is using to connect.
Why it works: TLS requires a successful handshake where both parties verify each other’s identity (or skip verification). Incorrect configuration prevents this handshake, leading to a connection failure.
After resolving these, your next alert will likely be about a missing metric or an alert rule evaluating incorrectly, as the system is now collecting data but not necessarily acting on it as you expect.