Logo

dev-resources.site

for different kinds of informations.

Rolling Out a Robust On-Call Process to Your Team

Published at
8/27/2024
Categories
incidentresponse
oncall
devops
sre
Author
talonx
Author
6 person written this
talonx
open
Rolling Out a Robust On-Call Process to Your Team

What is the best way to roll out an on-call schedule to your team?

If it’s a seasoned team which has been on on-call before, your task is easier. Most of us are not that lucky. Your team is probably a mix of people with different years of experience and familiarity with different ways of working.

Why on Call?

I personally go by the philosophy of “You Build It, You Run It”. Partly because it aligns very strongly with the DevOps movement’s original ethos. But more so because this adage underlines a number of attributes that a good engineer should try to imbibe - accountability, ownership, shared mission, and seeing the bigger picture.

Adopting a team philosophy cannot be done by mandate. It can only be done by example. There is usually some resistance - sometimes unspoken - to rolling out an on-call schedule to a dev team. In this article I have a list of points to make this easier (and not just for devs). This list is by no means exhaustive. It’s based on my experience so there are bound to be gaps.

Get These Right

Set the Correct Expectations for Your People

Make sure people are sold on the idea and why it’s important. If your organization’s philosophy is “You Build It, You Run It”, that’s a great starting point.

You might have to explain what being on-call means, and what is expected from team members. Create a list of points to talk through, and list out typical questions that people might shy away from asking in public (“if I am in the middle of having dinner and I get paged?”). See below for a template.

It’s important to highlight that for folks who will be on call for the first time, there might be a learning curve, and that’s ok.

On-Call Template

Set the Correct Expectations for Your Systems

It’s not possible to get a perfect process in the first few days or even weeks.

Establish an iterative process where your team:

  • Can fine tune and fix noisy alerts
  • Keep improving runbooks
  • Can work out schedule for themselves that works

There is no single process that works for all teams. Your goal is to understand what works best for each team and guide them. Gather feedback from people on what’s working and what’s not. Look at weekly reports for tell-tale signals like too many alerts for a service and escalations resulting from not responding.

Own the process and fix it as you go along.

Fix False Alerts

Hunt down and fix false alerts ruthlessly.

False alerts can occur due to different reasons:

  • You might have configured only Critical alerts to be pushed into your paging system, and an alert was mistakenly marked Critical when it should have been of lower severity.
  • You use a metrics-based system like Prometheus and there was an error in your metric query.
  • Your uptime monitor pings an HTTPS endpoint which has authentication and somebody changed the credentials.
  • Your alert checks for a threshold value which has changed and there’s a new “normal” for it.

Don’t delay fixing false alerts. I’ve failed at this in the past, and the results were not pretty.

Your paging system loses credibility with each false alert. Unfortunately, it might not be always possible to fix a false alert that feeds into the paging system. The only option then is to disable it.

Choose Your Tools Carefully

Most teams end up with multiple monitoring systems with each system generating incidents. Irrespective of how many incident generation systems you have, make sure they feed into a single paging/on-call system. Choose the on-call system based on the following:

  • Flexibility of scheduling - Different teams might have different on-call needs.
  • Ability to escalate easily.
  • Multiple options for sending notifications.
  • An easy way to override a team member’s schedule when somebody else has to take over temporarily.
  • Reporting abilities - How many alerts were fired during the week? How many were escalated? Which services had the most alerts? These will help you to find patterns and fix problem areas.

Your knowledge base is equally important. It’s where you will store your runbooks/post-mortem reports. It’s ideal if it integrates easily with your issue tracking - any post-mortem will result in a ToDo list, and they should be tracked in your issue tracking system.

The number of such tickets fixed is an important metric to measure.

Having an easy way for an on-call engineer to know who is on-call for other services/teams makes it convenient for people to reach out when an incident needs cross-team intervention. Your paging tool will have a roster view for this.

Lead by Example

Take the lead in:

  • Updating runbooks.
  • Sharing status updates for incidents in team communication channels.
  • Doing root cause analysis. The 5-whys method is good enough for most cases.
  • Fixing post-mortem tickets.
  • Staying open to feedback.
  • Leading blameless post-mortems. Blame the process, system, software - not people.

Expect the Unknown To Happen

Systems are imperfect, and people more so.

A first-time on-call engineer forgets to take their laptop along somewhere while being on-call? Murphy’s Law will kick in at some point, and be prepared for it.

Practice Empathy

Be in touch with your team’s needs. Talk to them regularly to understand their challenges with the on-call process.

Integrate the human factor into the process. A team member has a new baby? They are already losing sleep - take that into account when scheduling them for on-call. If you schedule them for less, you can create a process where they can take on more on-call duties later within a certain period of time so that it’s fair to others.

Celebrate Wins

Last but not the least - talk about how your on-call process is helping your team stay on top of incidents and thus helping the overall business. Exhort team members to talk about when runbooks helped them.

It’s a shared system and when it works, everybody wins.

The End Goal

There is no end goal here, because creating and maintaining a healthy on-call culture is an ongoing effort. It’s a journey, not a destination. When you invest in on-call, you are investing in the overall reliability of your systems, and in creating a more fulfilled engineering team.

oncall Article's
30 articles in total
Favicon
Simplify On-Call Management with a Modern Incident Management and Incident Response Platform
Favicon
The Importance of On-Call Incident Response Software: Enhancing Business Resilience and Engineer Effectiveness
Favicon
Callgoose SQIBS is an effective Real-time Incident Management and Incident Response Platform for Work from Home (WFH) Teams
Favicon
Simplify On-Call Management with a Modern Incident Management and Incident Response Platform
Favicon
Amplify Your Response Team's Impact: Introducing Squadcast’s Additional Responders
Favicon
Autocorrelate Alerts With Squadcast’s Key-Based Deduplication
Favicon
Surviving Your First On-Call Shift: 5 Essential Tips
Favicon
How To Reduce The Alert Noise For Optimal On-Call Performance
Favicon
All-in-One Incident Management: Why Squadcast Trumps Separate On-Call and Alerting Tools
Favicon
Automating On-Call Scheduling With Squadcast: Simplify Managing Schedules
Favicon
Best Practices For Building A Resilient On-Call Framework
Favicon
Rolling Out a Robust On-Call Process to Your Team
Favicon
Configure an Intuitive Service Dashboard & Reduce Response Time
Favicon
Suppressing Alert Noise during Scheduled Maintenance
Favicon
Journey of Streamlining Oncall and Incident Management
Favicon
On-Call manual: Onboarding a new person to the on-call rotation
Favicon
Improving Customer Support with Squadcast Webforms: A Smart Solution for MSPs
Favicon
On-call Manual: Measuring the quality of the on-call
Favicon
Comprehensive Guide to On-Call Scheduling Software for Enhanced Incident Response
Favicon
PagerDuty Community Update, January 12 2024
Favicon
PagerDuty Community Update, January 5 2024
Favicon
Navigating On-Call Compensation in the Tech Industry In 2023
Favicon
On-Call 101: How to begin
Favicon
SRE book notes: Being On-Call
Favicon
PagerDuty Community Year in Review: 2022
Favicon
What is on-call, and why is it important?
Favicon
Introducing the On-Call Me Maybe Podcast!
Favicon
Ask Austin: Putting The IR into ObseRvabIlity
Favicon
Dear PagerDuty, When Am I On Call?
Favicon
Better Sleep with PagerDuty Dynamic Notifications and Support Hours

Featured ones: