On-Call Alert Fatigue from Monitors That Fire on Symptoms Instead of Customer Impact

devtools0 views
Monitoring systems fire alerts on infrastructure-level symptoms (CPU > 80%, disk > 90%, pod restart count > 3) rather than actual customer-facing impact, generating dozens of non-actionable pages per on-call shift. So what? On-call engineers spend 20-40 minutes triaging each alert only to discover no users were affected, consuming hours of cognitive energy on false positives. So what? Engineers begin ignoring or snoozing alerts reflexively, developing 'alert blindness' where genuine incidents get the same dismissive response as noise. So what? When a real customer-impacting incident occurs, response time degrades from minutes to tens of minutes because the signal is buried in noise, extending outage duration. So what? Extended outages erode customer trust, trigger SLA violations with financial penalties, and create executive-level pressure on engineering leadership. So what? Leadership responds by adding more monitors and stricter escalation policies, which paradoxically increases noise further, creating a vicious cycle that burns out on-call engineers and drives attrition. The structural root cause is that monitoring systems are configured bottom-up from infrastructure metrics rather than top-down from service-level objectives (SLOs), because defining SLOs requires cross-functional agreement on what 'healthy' means for each user journey, which most organizations never formalize.

Evidence

PagerDuty's 2023 State of Digital Operations report found that 49% of alerts are non-actionable noise. Google's SRE book dedicates an entire chapter to alert philosophy, advocating for SLO-based alerting over threshold-based symptom alerts. Datadog's survey data shows the median on-call engineer handles 5-15 alerts per shift, with fewer than 30% requiring action. Burnout studies in DevOps (DORA reports) correlate high alert volume with lower deployment frequency and higher change failure rates.

Comments