Prometheus flapping alerts aren’t actually a problem with Prometheus itself, but rather a symptom of your alerting rules being too sensitive to transient network or application hiccups.

Here’s how to tame that notification storm.

The Problem: Transient Issues Triggering Alerts

Your alerting rules are likely set to fire alerts too quickly after a condition is met, or they don’t wait long enough for the condition to persist. This means brief, self-correcting issues – a momentary network blip, a pod restarting, a temporary spike in latency – are enough to trigger an alert, get cleared, and then re-triggered moments later. This "flapping" is noise, not signal, and it erodes trust in your alerting system.

Common Causes and Fixes

  1. Alerts Firing Too Soon (Low for Duration)

    • Diagnosis: Look at your alerting_rules.yml (or wherever your rules are defined). Most alerts have a for clause. If this is 0m or 1m, it’s very sensitive.
    • Cause: A rule like ALERT HighLatency FOR 0m ... will fire the instant latency goes above the threshold.
    • Fix: Increase the for duration. For critical services, 5m or 10m is often a good starting point. For less critical ones, 15m or even 30m.
      ALERT HighLatency
        IF http_request_duration_seconds_bucket{le="10"} < 0.95
        FOR 5m  # Wait for 5 minutes of sustained high latency
        LABELS { severity = "warning" }
        ANNOTATIONS {
          summary = "High request latency detected",
          description = "More than 5% of requests are taking longer than 10s for the last 5 minutes.",
        }
      
    • Why it works: This tells Prometheus, "Don’t bother me unless this condition persists for at least 5 minutes." It filters out those fleeting spikes.
  2. Alerts Clearing Too Soon (No state Guard)

    • Diagnosis: Alerts that flap might be configured to resolve immediately when the condition is no longer true, even if the underlying issue is still being worked on.
    • Cause: The default behavior is to resolve an alert as soon as the condition is no longer met. If the condition is met, then not met, then met again within a short period, you get flapping.
    • Fix: Add a state guard to your alert rule to prevent it from resolving too quickly if the condition is still flapping. This is less common in basic Prometheus setups but can be managed via Alertmanager configurations or more complex rule definitions if you’re using advanced features. A simpler approach is to ensure the for duration (as above) is sufficient.
    • Why it works: A longer for duration implicitly means the alert won’t resolve until that duration has passed without the condition being met.
  3. Thresholds Too Close to Normal Operation

    • Diagnosis: Examine the thresholds in your IF conditions. Are they only slightly above or below normal operating values?
    • Cause: If your CPU usage is normally 70% and you alert on node_cpu_seconds_total{mode="idle"} < 0.3 (meaning >70% usage), a temporary spike to 75% might trigger an alert that quickly resolves.
    • Fix: Widen the gap between your normal operating range and your alert threshold. For example, if normal CPU is 70%, you might want to alert when it hits 90% or 95%, not 71%.
      ALERT HighCPU
        IF 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
        FOR 10m # Only alert if CPU is above 90% for a sustained 10 minutes
        LABELS { severity = "critical" }
        ANNOTATIONS {
          summary = "High CPU utilization",
      
          description = "CPU usage on {{ $labels.instance }} has been above 90% for the last 10 minutes.",
      
        }
      
    • Why it works: This introduces a buffer zone, ensuring that only genuinely problematic, sustained deviations from normal behavior trigger alerts.
  4. rate() or delta() Window Too Small

    • Diagnosis: Alerts based on metrics like rate() or delta() might be using a very short lookback window.
    • Cause: A rate(http_requests_total[1m]) calculates the per-second rate over the last minute. If there’s a brief burst of requests followed by a lull, the average over that minute might still be high enough to trigger an alert that resolves quickly.
    • Fix: Increase the lookback window for your rate() or delta() functions. 5m or 10m is often more appropriate than 1m or 30s for detecting sustained changes.
      ALERT HighRequestRate
        IF rate(http_requests_total[5m]) > 1000
        FOR 5m # Alert if the average rate over 5 minutes exceeds 1000 requests/sec
        LABELS { severity = "warning" }
        ANNOTATIONS {
          summary = "High request rate detected",
          description = "The average request rate has been over 1000/sec for the last 5 minutes.",
        }
      
    • Why it works: A longer window smooths out short-term fluctuations, providing a more stable measure of the recent trend.
  5. Alerting on Metrics with High Cardinality or Volatility

    • Diagnosis: Are you alerting on metrics that have many unique label combinations (high cardinality) or metrics that naturally fluctuate wildly?
    • Cause: For example, alerting on http_requests_total{handler="/user/{id}"} could create an alert for every single user ID, which is usually undesirable. Or, alerting on a highly volatile metric like network packet loss might be noisy.
    • Fix:
      • Reduce Cardinality: Use aggregation in your alert. Instead of alerting on individual handlers, alert on the aggregate for the service.
        ALERT HighServiceRequestRate
          IF sum(rate(http_requests_total[5m])) by (job) > 10000
          FOR 5m
          LABELS { severity = "warning" }
          ANNOTATIONS {
            summary = "High request rate for service",
        
            description = "The total request rate for job {{ $labels.job }} has been over 10000/sec for the last 5 minutes.",
        
          }
        
      • Choose Stable Metrics: If a metric is inherently volatile, consider if it’s the right one to alert on, or if you need to derive a more stable indicator from it (e.g., a rolling average, or a metric that represents a longer-term trend).
    • Why it works: Aggregation reduces the number of individual alerts. Choosing stable metrics or deriving stable indicators means you’re reacting to meaningful, sustained changes, not transient noise.
  6. Alertmanager Configuration Issues (Less Common for Flapping, More for Routing/Grouping)

    • Diagnosis: While less direct, a misconfigured Alertmanager group_wait or group_interval can sometimes appear like flapping if alerts are being grouped and ungrouped rapidly.
    • Cause: If alerts are firing and resolving within the group_wait period, Alertmanager might be constantly re-evaluating which group they belong to, leading to rapid changes in notifications.
    • Fix: Ensure group_wait is set to a reasonable value, typically a few minutes (30s to 5m), to allow Prometheus to fire multiple related alerts before Alertmanager attempts to group them.
      global:
        resolve_timeout: 5m
      
      route:
        group_by: ['alertname', 'job']
        group_wait: 30s     # Wait 30s for more alerts before sending a notification group
        group_interval: 5m  # Wait 5m before sending repeat notifications for the same group
        repeat_interval: 4h
        receiver: 'default-receiver'
      
      receivers:
      - name: 'default-receiver'
        webhook_configs:
        - url: 'http://your-webhook-receiver:9094'
      
    • Why it works: group_wait ensures that Alertmanager doesn’t send out notifications for a small, incomplete set of alerts, allowing related alerts to be bundled together for a more coherent notification.

The Next Hurdle: Alert Fatigue

Once you’ve tamed flapping, the next challenge is "alert fatigue"—having too many distinct, non-flapping alerts. This often leads to people ignoring alerts altogether. The solution involves rigorous pruning of noisy alerts, better grouping and routing in Alertmanager, and refining your alerting strategy to focus on actionable incidents, not just symptoms.

Want structured learning?

Take the full Prometheus course →