Alertmanager’s routing isn’t just a simple switchboard; it’s a sophisticated, stateful engine that processes alerts based on their labels, deciding not only where they go but also how they’re grouped and silenced.
Let’s watch Alertmanager in action with a sample alert. Imagine we have Prometheus scraping a service, and a critical metric http_requests_total goes south. Prometheus fires an alert named HighRequestLatency.
# prometheus.yml
scrape_configs:
- job_name: 'my-app'
static_configs:
- targets: ['localhost:9090']
rule_files:
- 'alert_rules.yml'
# alert_rules.yml
groups:
- name: http_alerts
rules:
- alert: HighRequestLatency
expr: rate(http_requests_total{job="my-app", status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High request latency detected on {{ $labels.instance }}"
description: "The {{ $labels.job }} job on {{ $labels.instance }} is experiencing high latency (5xx errors)."
When Prometheus detects this condition for 5 minutes, it sends the HighRequestLatency alert to Alertmanager. Alertmanager receives this alert and immediately consults its routing tree.
Here’s a simplified Alertmanager configuration:
# alertmanager.yml
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
- match_re:
team: '(backend|frontend)'
receiver: 'dev-team-notifications'
receivers:
- name: 'default-receiver'
webhook_configs:
- url: 'http://localhost:5001/' # Default fallback
- name: 'critical-alerts'
slack_configs:
- channel: '#critical-alerts'
api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
- name: 'dev-team-notifications'
email_configs:
- to: 'dev-team@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'your_smtp_password'
In this configuration:
- The top-level
routedefines default behavior: group alerts byalertnameandjob, wait 30 seconds (group_wait) before sending the first notification for a new group, and wait 5 minutes (group_interval) before sending a notification about new alerts within an existing, already notified group. Alerts will repeat every 4 hours (repeat_interval) if they remain active. - The
routessection is where the magic happens. Alertmanager evaluates these sequentially.- The first route
matches alerts withseverity: critical. If an alert matches this, it’s sent to thecritical-alertsreceiver (which sends to Slack channel#critical-alerts). Thecontinue: truemeans that even though it matched this route, Alertmanager continues to evaluate subsequent routes. - The second route
match_re(regex match) looks for alerts where theteamlabel is eitherbackendorfrontend. If the alert matches this (and it will, because ourHighRequestLatencyalert hasteam: backend), it’s sent to thedev-team-notificationsreceiver (emailingdev-team@example.com).
- The first route
Because our HighRequestLatency alert has severity: critical and team: backend, it will be routed to both the critical-alerts Slack channel and the dev-team@example.com email address due to continue: true. If continue were false (the default), it would only go to the first matching receiver.
The group_by directive is crucial. Alertmanager collects alerts that share the same set of labels specified in group_by. For our HighRequestLatency alert, it will be grouped with other alerts having the same alertname and job. This prevents a flood of individual notifications for a single incident affecting multiple instances of the same service. The group_wait ensures that if multiple instances of the same alert fire within that wait period, they are bundled into a single notification.
The one thing most people miss is how continue: true interacts with group_by and group_wait. If an alert matches multiple routes with continue: true, it will be sent to all those receivers. However, Alertmanager still applies group_by and group_wait per receiver. So, if the same alert is routed to two different receivers, it might trigger notifications to both at slightly different times, or be batched differently if other alerts also arrive for those specific routes. This can lead to complex notification patterns if not carefully managed.
The next thing you’ll run into is managing notification inhibition, where one alert can suppress another.