Prometheus Dead Man’s Snitch isn’t a feature, it’s a clever workaround for a fundamental problem: how do you know if your monitoring system itself has stopped monitoring?
Let’s watch it in action. Imagine Prometheus is happily scraping metrics from your services. You’ve got alerts set up for your apps, but what if Prometheus itself chokes? This is where Dead Man’s Snitch comes in. It’s a simple HTTP endpoint that expects to be pinged regularly. If it doesn’t get that ping within a configurable timeout, it fires off an alert, usually to Slack or email, telling you Prometheus is MIA.
Here’s the core idea: Prometheus, the system designed to alert you when things go wrong, needs its own watchdog.
Here’s how you set it up.
First, you need a Dead Man’s Snitch service. You can use a hosted service like the one at deadmanssnitch.com, or run your own. For this example, we’ll assume you’re using the hosted service. You’ll get a unique URL for each "snitch" you create.
https://nosnch.com/D34DM4N5N1TCH-XYZ123
Next, you need Prometheus to ping this URL. The simplest way to do this is by creating a Prometheus rule that only evaluates if Prometheus is healthy and has recently scraped a target. We’ll use the up metric, which Prometheus automatically generates for every target it scrapes. up == 1 means the target is up and Prometheus successfully scraped it.
In your Prometheus configuration, specifically in a .rules.yml file that Prometheus loads, you’ll add a rule like this:
groups:
- name: deadmanssnitch
rules:
- alert: PrometheusDown
expr: |
vector(1) unless on(job) group_left() up{job="prometheus"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus has not been scraped successfully in the last 5 minutes."
description: "Prometheus itself is down or unable to scrape targets."
Wait, that rule fires when up is not 1. That’s the wrong direction for a Dead Man’s Snitch. The point of Dead Man’s Snitch is that it expects a regular signal. If it doesn’t get it, it alerts.
The common way to implement this within Prometheus is to have Prometheus scrape a target that only exists if Prometheus is running and healthy. The up metric is perfect for this.
Here’s the actual, common pattern. You create a Prometheus rule that sends a signal out. This is usually done with an exporter that Prometheus scrapes, and that exporter then pings the Dead Man’s Snitch.
Let’s use the blackbox_exporter for this. The blackbox_exporter can probe any HTTP endpoint. We’ll configure it to probe our Dead Man’s Snitch URL.
First, configure blackbox_exporter’s blackbox.yml:
modules:
http_2xx:
prober: http
timeout: 5s
method: GET
http:
valid_status_codes: [] # Defaults to 2xx
method: GET
Now, tell Prometheus to scrape the blackbox_exporter and have the blackbox_exporter probe your snitch URL. Add this to your prometheus.yml’s scrape_configs:
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- 'https://nosnch.com/D34DM4N5N1TCH-XYZ123' # Your Dead Man's Snitch URL
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter.your-namespace.svc:9115 # Address of your blackbox exporter
This tells Prometheus to scrape the blackbox_exporter at blackbox-exporter.your-namespace.svc:9115. The blackbox_exporter, when scraped, will probe the URL specified in __param_target, which we’ve set to our Dead Man’s Snitch URL. If the blackbox_exporter successfully pings the snitch URL, Prometheus will record an up metric for the blackbox job as 1.
Now, we create an alert in Prometheus that fires if this up metric is not 1.
groups:
- name: deadmanssnitch
rules:
- alert: PrometheusBlackboxDown
expr: |
up{job="blackbox"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Prometheus is unable to ping its Dead Man's Snitch."
description: "The blackbox exporter, tasked with pinging the Dead Man's Snitch, is not reporting as up. This likely means Prometheus itself is down or cannot reach the blackbox exporter, or the blackbox exporter cannot reach the snitch."
The for: 10m is crucial. It means the alert will only fire if the up == 0 condition persists for 10 minutes, preventing flapping alerts due to transient network issues.
The reason this works is that the blackbox_exporter is an external service that Prometheus scrapes. If Prometheus itself dies, it can’t scrape the blackbox_exporter. If the blackbox_exporter dies, Prometheus can’t scrape it. And if the blackbox_exporter is running but can’t reach the Dead Man’s Snitch URL, the up metric for the blackbox job will drop to 0. Any of these scenarios will trigger the PrometheusBlackboxDown alert.
The most surprising thing is that the blackbox_exporter itself is the indicator of Prometheus’s health. You’re using an external probe to monitor your internal monitoring system, which feels a bit like using a mirror to check if your eyes are open.
The actual mechanism by which the blackbox_exporter pings the snitch is a standard HTTP GET request. The nosnch.com service simply expects this GET request to arrive within its configured interval (default 5 minutes). If the request doesn’t hit the endpoint, nosnch.com assumes the client (in this case, the blackbox_exporter acting on Prometheus’s behalf) has gone silent and triggers its own notification. The Prometheus alert is therefore a secondary alert, firing after the Dead Man’s Snitch service itself has already detected the silence and is about to alert you.
The next thing you’ll want to monitor is the health of the blackbox_exporter itself, independently of its ability to ping the snitch.