Logo

dev-resources.site

for different kinds of informations.

When Alerts Don’t Mean Downtime - Preventing SRE Fatigue

Published at
9/12/2024
Categories
devops
sre
monitoring
incidentresponse
Author
talonx
Author
6 person written this
talonx
open
When Alerts Don’t Mean Downtime - Preventing SRE Fatigue

Introduction

A recent question in an SRE forum triggered this train of thought.

How do I deal with alerts that are triggered by internal patching/release activities but don't actually cause a downtime? If we react to these alerts we might not have time to react to actual alerts that are affecting customers.

I've paraphrased the question to reflect its essence. There is plenty to unravel here.

My first reaction to this question was that the SRE who posted this is in a difficult place with systemic issues.

Systemic Issues

Without knowing more about the org and their alerting policies, let's look at what we can dig out based on this question alone

  • Patches/deployments trigger alerts
  • The team does not react to such alerts to avoid spending valuable time that can be directed towards solving downtime that is affecting customers
  • There is cognitive overhead of selectively reacting to some alerts, and ignoring others
  • The knowledge of which alerts to react to is something only the SRE team knows
  • Any MTTx data from such a setup are useless

The eventual impact is sub-optimal incident management, eventually affecting SLAs, and burnout in on-call folks.

Improving the SRE Experience

How would you approach fixing something like this?

Some thoughts, in no particular order

  • Setting the correct priority for alerts - Anything that affects customer perception of uptime, or can lead to data loss, is a P1. In larger organizations with independent teams responsible for their own microservices, I would extend the definition of customer to any team in your org that depends on your service(s). If you are responsible for an API used by a downstream service, they are your customers too.

  • Zero-downtime deployments - This is not as hard as it sounds if you design your systems with this goal in mind. For stateless web applications it is trivial to switch to a new version behind a load balancer. For stateful applications it can take a bit more work.

  • Maintenance mode - This can fall into two categories - maintenance mode that has to be communicated to the customer, and maintenance mode that is internal - affecting other teams who consume your service. At the alerting level, you temporarily silence the specific alerts that will get triggered by the rollout.

  • Investigate all alerts and disable useless ones - Not looking at an alert creates indeterminism and can lead to alert fatigue. The alerting system should be the single source of truth.

Solving such issues has to be a team effort involving the dev teams also. You can start by recognizing customer-facing uptime and having a sustainable on-call process as the priorities.

Photo by CDC on Unsplash

incidentresponse Article's
30 articles in total
Favicon
Automate IT Incident Responses with Callgoose SQIBS
Favicon
From Chaos to Calm: Building an Efficient On-Call System
Favicon
Transform IT Operations with Callgoose SQIBS
Favicon
How Incident Response and Automation Platforms Revolutionize the Financial Services Industry
Favicon
Elevating Manufacturing Resilience: The Role of Incident Response and Automation Platforms
Favicon
The Importance of On-Call Incident Response Software: Enhancing Business Resilience and Engineer Effectiveness
Favicon
Transforming Safety and Efficiency: The Role of Incident Response and Automation Platforms in the Pharmaceutical Industry
Favicon
Kubernetes Incident Response: What You Must Know Now!
Favicon
Strategies to Reduce Mean Time to Respond (MTTR) in Your Security Operations Center (SOC)
Favicon
Enhancing Incident Response with Tracing: Reducing MTTD and MTTR
Favicon
Enhancing Incident Resolution with Context-Rich Alerts and Incident Response Software
Favicon
10+ Best Incident Management Software To Streamline IT In 2025
Favicon
Understanding Vulnerabilities, Threats, and Risks: Safeguarding Your Business Reputation
Favicon
Callgoose SQIBS is an effective Real-time Incident Management and Incident Response Platform for Work from Home (WFH) Teams
Favicon
Understanding Vulnerabilities, Threats, and Risks: Safeguarding Your Business Reputation
Favicon
Demystifying Incidents and Bugs: Understanding the Difference and Implications
Favicon
Incident Management vs Incident Response: What You Must Know
Favicon
The Vital Role of Human Oversight in AI-Driven Incident Management and SRE
Favicon
The Comprehensive Guide to On-Call Policies, Pay, Support & Onboarding Engineers
Favicon
The Incident Response Lifecycle: Strategies for Effective Incident Management
Favicon
The Significance of Single Sign-On (SSO) in the Modern Business World
Favicon
The Imperative of Integrating Critical Systems into Modern Incident Response Systems
Favicon
Enterprise-Grade ITSM: Scaling Incident Response with ServiceNow & Squadcast
Favicon
How Squadcast’s Workflows Enhance Incident Management Automation?
Favicon
How Squadcast Helps With Flapping Alerts
Favicon
Advancing Aerospace and Defense: The Impact of Incident Response and Automation Platforms
Favicon
When Alerts Don’t Mean Downtime - Preventing SRE Fatigue
Favicon
Simplifying Service Dependency With Squadcast's Service Graph
Favicon
The Complete Incident Management Tech Stack To Increase Performance, Reduce Cost And Optimize Tool Sprawl
Favicon
Decoding Severity: A Guide to Differentiating Major vs Critical Incidents

Featured ones: