dev-resources.site

for different kinds of informations.

Rolling Out a Robust On-Call Process to Your Team

Published at

8/27/2024

Why on Call?

I personally go by the philosophy of “You Build It, You Run It”. Partly because it aligns very strongly with the DevOps movement’s original ethos. But more so because this adage underlines a number of attributes that a good engineer should try to imbibe - accountability, ownership, shared mission, and seeing the bigger picture.

Adopting a team philosophy cannot be done by mandate. It can only be done by example. There is usually some resistance - sometimes unspoken - to rolling out an on-call schedule to a dev team. In this article I have a list of points to make this easier (and not just for devs). This list is by no means exhaustive. It’s based on my experience so there are bound to be gaps.

Get These Right

Set the Correct Expectations for Your People

Make sure people are sold on the idea and why it’s important. If your organization’s philosophy is “You Build It, You Run It”, that’s a great starting point.

You might have to explain what being on-call means, and what is expected from team members. Create a list of points to talk through, and list out typical questions that people might shy away from asking in public (“if I am in the middle of having dinner and I get paged?”). See below for a template.

It’s important to highlight that for folks who will be on call for the first time, there might be a learning curve, and that’s ok.

Set the Correct Expectations for Your Systems

It’s not possible to get a perfect process in the first few days or even weeks.

Establish an iterative process where your team:

Can fine tune and fix noisy alerts
Keep improving runbooks
Can work out schedule for themselves that works

There is no single process that works for all teams. Your goal is to understand what works best for each team and guide them. Gather feedback from people on what’s working and what’s not. Look at weekly reports for tell-tale signals like too many alerts for a service and escalations resulting from not responding.

Own the process and fix it as you go along.

Fix False Alerts

Hunt down and fix false alerts ruthlessly.

False alerts can occur due to different reasons:

You might have configured only Critical alerts to be pushed into your paging system, and an alert was mistakenly marked Critical when it should have been of lower severity.
You use a metrics-based system like Prometheus and there was an error in your metric query.
Your uptime monitor pings an HTTPS endpoint which has authentication and somebody changed the credentials.
Your alert checks for a threshold value which has changed and there’s a new “normal” for it.

Don’t delay fixing false alerts. I’ve failed at this in the past, and the results were not pretty.

Your paging system loses credibility with each false alert. Unfortunately, it might not be always possible to fix a false alert that feeds into the paging system. The only option then is to disable it.

Choose Your Tools Carefully

Most teams end up with multiple monitoring systems with each system generating incidents. Irrespective of how many incident generation systems you have, make sure they feed into a single paging/on-call system. Choose the on-call system based on the following:

Flexibility of scheduling - Different teams might have different on-call needs.
Ability to escalate easily.
Multiple options for sending notifications.
An easy way to override a team member’s schedule when somebody else has to take over temporarily.
Reporting abilities - How many alerts were fired during the week? How many were escalated? Which services had the most alerts? These will help you to find patterns and fix problem areas.

Your knowledge base is equally important. It’s where you will store your runbooks/post-mortem reports. It’s ideal if it integrates easily with your issue tracking - any post-mortem will result in a ToDo list, and they should be tracked in your issue tracking system.

The number of such tickets fixed is an important metric to measure.

Having an easy way for an on-call engineer to know who is on-call for other services/teams makes it convenient for people to reach out when an incident needs cross-team intervention. Your paging tool will have a roster view for this.

Lead by Example

Take the lead in:

Updating runbooks.
Sharing status updates for incidents in team communication channels.
Doing root cause analysis. The 5-whys method is good enough for most cases.
Fixing post-mortem tickets.
Staying open to feedback.
Leading blameless post-mortems. Blame the process, system, software - not people.

Expect the Unknown To Happen

Systems are imperfect, and people more so.

A first-time on-call engineer forgets to take their laptop along somewhere while being on-call? Murphy’s Law will kick in at some point, and be prepared for it.

Practice Empathy

Be in touch with your team’s needs. Talk to them regularly to understand their challenges with the on-call process.

Integrate the human factor into the process. A team member has a new baby? They are already losing sleep - take that into account when scheduling them for on-call. If you schedule them for less, you can create a process where they can take on more on-call duties later within a certain period of time so that it’s fair to others.

Celebrate Wins

Last but not the least - talk about how your on-call process is helping your team stay on top of incidents and thus helping the overall business. Exhort team members to talk about when runbooks helped them.

It’s a shared system and when it works, everybody wins.

The End Goal

There is no end goal here, because creating and maintaining a healthy on-call culture is an ongoing effort. It’s a journey, not a destination. When you invest in on-call, you are investing in the overall reliability of your systems, and in creating a more fulfilled engineering team.

oncall Article's

30 articles in total

Simplify On-Call Management with a Modern Incident Management and Incident Response Platform