Prometheus Alerting Rules are a surprisingly robust mechanism for detecting and notifying about issues, but their effectiveness hinges entirely on how well you understand their lifecycle and potential pitfalls.
Let’s watch an alert fire. Imagine we’ve got a simple Prometheus setup monitoring a few services. Here’s a rule that fires if a specific service, my_app_instance_1, has been down for more than 5 minutes:
groups:
- name: my_app_alerts
rules:
- alert: MyAppDown
expr: up{job="my_app", instance="my_app_instance_1"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "My application instance {{ $labels.instance }} is down."
description: "The Prometheus agent has not seen 'up' for {{ $labels.instance }} for over 5 minutes."
When my_app_instance_1 stops reporting (e.g., its Prometheus exporter crashes or the instance itself goes offline), the up metric for that specific target will drop to 0. Prometheus evaluates this rule every evaluation interval (default 15 seconds). For the first 5 minutes, the alert condition (up == 0) will be true, but the alert won’t fire because of the for: 5m clause. Once 5 minutes have passed with the condition remaining true, Prometheus transitions the alert from Pending to Firing. This Firing state is what Alertmanager then picks up.
The core problem Prometheus Alerting Rules solve is transforming raw time-series data into actionable notifications. Instead of constantly querying Prometheus for specific conditions, you define these conditions declaratively. Prometheus then handles the ongoing evaluation. Alertmanager, a separate component, receives these alerts and routes them based on labels (like severity) to different receivers (email, Slack, PagerDuty).
Internally, Prometheus maintains an "alert state" for each rule. When an expression evaluates to true, the alert enters a Pending state. If the expression remains true for the duration specified by for, it transitions to Firing. If the expression becomes false, it transitions back to Inactive. This Pending state is crucial; it prevents flapping alerts caused by transient network blips or very short-lived service interruptions.
The levers you control are primarily within the rule definitions:
expr: The PromQL query that defines the condition. This is where you specify what constitutes a problem.for: The duration an alert must be in aPendingstate before it becomesFiring. This is your threshold for "serious enough to bother someone."labels: Key-value pairs attached to the alert. These are critical for routing in Alertmanager. Common labels includeseverity,team,service.annotations: Additional, often human-readable, information about the alert. This is where you provide context for what the alert means and how to potentially fix it.
Consider this rule for high CPU usage on your web servers:
groups:
- name: node_alerts
rules:
- alert: HighCPUUsage
expr: 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 85
for: 10m
labels:
severity: warning
team: ops
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "Instance {{ $labels.instance }} has had CPU usage above 85% for 10 minutes."
This rule uses rate(node_cpu_seconds_total{mode="idle"}[5m]) to calculate the average idle CPU time over the last 5 minutes. Subtracting this from 100 gives you the current CPU usage percentage. The for: 10m means it only fires if this high usage persists for a full 10 minutes, preventing alerts on temporary spikes. The severity: warning and team: ops labels would then be used by Alertmanager to route this to the operations team’s Slack channel, perhaps with a lower urgency than a critical alert.
A common misconception is that for is a simple debounce. It’s more than that; it’s a temporal gatekeeper. If an alert condition is true for 4 minutes, then false for 1 minute, then true again for 5 minutes, the for: 5m timer restarts from zero when the condition becomes true the second time. It doesn’t remember the previous 4 minutes. This is a key distinction for understanding alert persistence and recovery.
The next logical step after mastering basic alerting rules is understanding how to manage alert states and implement sophisticated routing and inhibition logic within Alertmanager itself.