Prometheus flapping alerts aren’t actually a problem with Prometheus itself, but rather a symptom of your alerting rules being too sensitive to transient network or application hiccups.
Here’s how to tame that notification storm.
The Problem: Transient Issues Triggering Alerts
Your alerting rules are likely set to fire alerts too quickly after a condition is met, or they don’t wait long enough for the condition to persist. This means brief, self-correcting issues – a momentary network blip, a pod restarting, a temporary spike in latency – are enough to trigger an alert, get cleared, and then re-triggered moments later. This "flapping" is noise, not signal, and it erodes trust in your alerting system.
Common Causes and Fixes
-
Alerts Firing Too Soon (Low
forDuration)- Diagnosis: Look at your
alerting_rules.yml(or wherever your rules are defined). Most alerts have aforclause. If this is0mor1m, it’s very sensitive. - Cause: A rule like
ALERT HighLatency FOR 0m ...will fire the instant latency goes above the threshold. - Fix: Increase the
forduration. For critical services,5mor10mis often a good starting point. For less critical ones,15mor even30m.ALERT HighLatency IF http_request_duration_seconds_bucket{le="10"} < 0.95 FOR 5m # Wait for 5 minutes of sustained high latency LABELS { severity = "warning" } ANNOTATIONS { summary = "High request latency detected", description = "More than 5% of requests are taking longer than 10s for the last 5 minutes.", } - Why it works: This tells Prometheus, "Don’t bother me unless this condition persists for at least 5 minutes." It filters out those fleeting spikes.
- Diagnosis: Look at your
-
Alerts Clearing Too Soon (No
stateGuard)- Diagnosis: Alerts that flap might be configured to resolve immediately when the condition is no longer true, even if the underlying issue is still being worked on.
- Cause: The default behavior is to resolve an alert as soon as the condition is no longer met. If the condition is met, then not met, then met again within a short period, you get flapping.
- Fix: Add a
stateguard to your alert rule to prevent it from resolving too quickly if the condition is still flapping. This is less common in basic Prometheus setups but can be managed via Alertmanager configurations or more complex rule definitions if you’re using advanced features. A simpler approach is to ensure theforduration (as above) is sufficient. - Why it works: A longer
forduration implicitly means the alert won’t resolve until that duration has passed without the condition being met.
-
Thresholds Too Close to Normal Operation
- Diagnosis: Examine the thresholds in your
IFconditions. Are they only slightly above or below normal operating values? - Cause: If your CPU usage is normally 70% and you alert on
node_cpu_seconds_total{mode="idle"} < 0.3(meaning >70% usage), a temporary spike to 75% might trigger an alert that quickly resolves. - Fix: Widen the gap between your normal operating range and your alert threshold. For example, if normal CPU is 70%, you might want to alert when it hits 90% or 95%, not 71%.
ALERT HighCPU IF 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90 FOR 10m # Only alert if CPU is above 90% for a sustained 10 minutes LABELS { severity = "critical" } ANNOTATIONS { summary = "High CPU utilization", description = "CPU usage on {{ $labels.instance }} has been above 90% for the last 10 minutes.", } - Why it works: This introduces a buffer zone, ensuring that only genuinely problematic, sustained deviations from normal behavior trigger alerts.
- Diagnosis: Examine the thresholds in your
-
rate()ordelta()Window Too Small- Diagnosis: Alerts based on metrics like
rate()ordelta()might be using a very short lookback window. - Cause: A
rate(http_requests_total[1m])calculates the per-second rate over the last minute. If there’s a brief burst of requests followed by a lull, the average over that minute might still be high enough to trigger an alert that resolves quickly. - Fix: Increase the lookback window for your
rate()ordelta()functions.5mor10mis often more appropriate than1mor30sfor detecting sustained changes.ALERT HighRequestRate IF rate(http_requests_total[5m]) > 1000 FOR 5m # Alert if the average rate over 5 minutes exceeds 1000 requests/sec LABELS { severity = "warning" } ANNOTATIONS { summary = "High request rate detected", description = "The average request rate has been over 1000/sec for the last 5 minutes.", } - Why it works: A longer window smooths out short-term fluctuations, providing a more stable measure of the recent trend.
- Diagnosis: Alerts based on metrics like
-
Alerting on Metrics with High Cardinality or Volatility
- Diagnosis: Are you alerting on metrics that have many unique label combinations (high cardinality) or metrics that naturally fluctuate wildly?
- Cause: For example, alerting on
http_requests_total{handler="/user/{id}"}could create an alert for every single user ID, which is usually undesirable. Or, alerting on a highly volatile metric like network packet loss might be noisy. - Fix:
- Reduce Cardinality: Use aggregation in your alert. Instead of alerting on individual handlers, alert on the aggregate for the service.
ALERT HighServiceRequestRate IF sum(rate(http_requests_total[5m])) by (job) > 10000 FOR 5m LABELS { severity = "warning" } ANNOTATIONS { summary = "High request rate for service", description = "The total request rate for job {{ $labels.job }} has been over 10000/sec for the last 5 minutes.", } - Choose Stable Metrics: If a metric is inherently volatile, consider if it’s the right one to alert on, or if you need to derive a more stable indicator from it (e.g., a rolling average, or a metric that represents a longer-term trend).
- Reduce Cardinality: Use aggregation in your alert. Instead of alerting on individual handlers, alert on the aggregate for the service.
- Why it works: Aggregation reduces the number of individual alerts. Choosing stable metrics or deriving stable indicators means you’re reacting to meaningful, sustained changes, not transient noise.
-
Alertmanager Configuration Issues (Less Common for Flapping, More for Routing/Grouping)
- Diagnosis: While less direct, a misconfigured Alertmanager
group_waitorgroup_intervalcan sometimes appear like flapping if alerts are being grouped and ungrouped rapidly. - Cause: If alerts are firing and resolving within the
group_waitperiod, Alertmanager might be constantly re-evaluating which group they belong to, leading to rapid changes in notifications. - Fix: Ensure
group_waitis set to a reasonable value, typically a few minutes (30sto5m), to allow Prometheus to fire multiple related alerts before Alertmanager attempts to group them.global: resolve_timeout: 5m route: group_by: ['alertname', 'job'] group_wait: 30s # Wait 30s for more alerts before sending a notification group group_interval: 5m # Wait 5m before sending repeat notifications for the same group repeat_interval: 4h receiver: 'default-receiver' receivers: - name: 'default-receiver' webhook_configs: - url: 'http://your-webhook-receiver:9094' - Why it works:
group_waitensures that Alertmanager doesn’t send out notifications for a small, incomplete set of alerts, allowing related alerts to be bundled together for a more coherent notification.
- Diagnosis: While less direct, a misconfigured Alertmanager
The Next Hurdle: Alert Fatigue
Once you’ve tamed flapping, the next challenge is "alert fatigue"—having too many distinct, non-flapping alerts. This often leads to people ignoring alerts altogether. The solution involves rigorous pruning of noisy alerts, better grouping and routing in Alertmanager, and refining your alerting strategy to focus on actionable incidents, not just symptoms.