Logo

dev-resources.site

for different kinds of informations.

On-call Manual: Measuring the quality of the on-call

Published at
7/4/2024
Categories
oncall
softwareengineering
career
devops
Author
moozzyk
Author
7 person written this
moozzyk
open
On-call Manual: Measuring the quality of the on-call

Reasonable on-call is no accident. Getting there requires a lot of hard work. But how can you tell if you’re on the right track if the experience can completely change from one shift to another? One answer to this question is monitoring.

How does monitoring help?

At the high level, monitoring can tell you if the on-call duty is improving, staying the same, or deteriorating over a longer period. Understanding the trend is important to decide whether the current investment in keeping the on-call reasonable is sufficient.

At the more granular level, monitoring allows identifying areas that need attention the most, like:

  • noisy alerts
  • problematic dependencies
  • features causing customers’ complaints
  • repetitive tasks

Continuously addressing the top issues will gradually improve the overall on-call experience.

What metrics to monitor

There is no one correct answer to what metrics to monitor. It depends a lot on what the team does. For example, frontend teams may choose to monitor the number of tickets opened by the customers, while backend teams may want to focus more on time spent on fixing broken builds or failing tests. Here are some metrics to consider:

  • outages of the products the team owns
  • external incidents impacting the products the team owns
  • the number of alerts, broken down by urgency
  • the number of alerts alerts acted on and ignored
  • the number of alerts outside the working hours
  • time to acknowledge alerts
  • the number of tickets opened by customers
  • the number of internal tasks
  • build breaks
  • test failures

How to monitor?

On-call monitoring is difficult because there isn’t a single metric that can reflect the health of the on-call. My team uses quantitative (data) and qualitative metrics (opinions).

Qualitative metrics

Quantitative metrics can usually be collected from alerting systems, bug trackers, and task management systems. Here are a few examples of quantitative metrics we are tracking on our team:

  • the number of alerts
  • the number of tasks
  • the number of alerts outside the working hours
  • the noisiest alerts, tracked by alert ID

Beeping, Beeping Everywhere

As quantitative metrics are collected automatically, we built a dashboard to show them in an easy-to-understand way. Keeping historical data allows us to track trends.

Qualitative metrics

Qualitative metrics are opinions about the shift from the person ending the shift. Using qualitative metrics in addition to quantitative metrics is necessary because numbers are sometimes misleading. Here is an example: handling a dozen tasks that can be closed almost immediately without much effort is easier than collaborating with a few teams to investigate a hard-to-reproduce customer report. However, considering only how many tasks each on-call got during their shift, the first shift appears heavier than the second.

On our team, each person going off-call fills out an On-call survey that is part of the On-call report. Here are some of the questions from the survey:

  • Rate your on-call experience from 1 to 10 (1: easy, 10: horrible)
  • Rate your experience with resources available for resolving on-call issues (e.g., runbooks, documentation, tools, etc.) from 1 to 10 (1: no resources or very poor resources, 10: excellent resources that helped solve issues quickly)
  • How much time did you spend on urgent activities like alerts, fire fighting, etc. (0%-100%)?
  • How much time did you spend on non-urgent activities like non-urgent tasks, noise, etc. (0%-100%)?
  • Additional comments (free flow)

We’ve been conducting this survey for a couple of years now. One interesting observation I made is that it is not uncommon for a horrible shift for one person to be decent for someone else. Experienced on-calls usually rate their shifts easier than developers who just finished their first shift. This is understandable. We still treat all opinions equally—improving the on-call quality for one person improves it for everyone.

The Additional comments question is my favorite as it provides insights no other metric can capture.

Call to Action

If being on-call is part of your team’s responsibilities and you don’t monitor it, I highly encourage you to start doing so. Even a simple monitoring system will tell you a lot about your on-call and allow you to improve it by addressing the most annoying issues.


💙 If you liked this article...

I publish a weekly newsletter for software engineers who want to grow their careers. I share mistakes I’ve made and lessons I’ve learned over the past 20 years as a software engineer.

Sign up here to get articles like this delivered to your inbox:
https://www.growingdev.net/

oncall Article's
30 articles in total
Favicon
Simplify On-Call Management with a Modern Incident Management and Incident Response Platform
Favicon
The Importance of On-Call Incident Response Software: Enhancing Business Resilience and Engineer Effectiveness
Favicon
Callgoose SQIBS is an effective Real-time Incident Management and Incident Response Platform for Work from Home (WFH) Teams
Favicon
Simplify On-Call Management with a Modern Incident Management and Incident Response Platform
Favicon
Amplify Your Response Team's Impact: Introducing Squadcast’s Additional Responders
Favicon
Autocorrelate Alerts With Squadcast’s Key-Based Deduplication
Favicon
Surviving Your First On-Call Shift: 5 Essential Tips
Favicon
How To Reduce The Alert Noise For Optimal On-Call Performance
Favicon
All-in-One Incident Management: Why Squadcast Trumps Separate On-Call and Alerting Tools
Favicon
Automating On-Call Scheduling With Squadcast: Simplify Managing Schedules
Favicon
Best Practices For Building A Resilient On-Call Framework
Favicon
Rolling Out a Robust On-Call Process to Your Team
Favicon
Configure an Intuitive Service Dashboard & Reduce Response Time
Favicon
Suppressing Alert Noise during Scheduled Maintenance
Favicon
Journey of Streamlining Oncall and Incident Management
Favicon
On-Call manual: Onboarding a new person to the on-call rotation
Favicon
Improving Customer Support with Squadcast Webforms: A Smart Solution for MSPs
Favicon
On-call Manual: Measuring the quality of the on-call
Favicon
Comprehensive Guide to On-Call Scheduling Software for Enhanced Incident Response
Favicon
PagerDuty Community Update, January 12 2024
Favicon
PagerDuty Community Update, January 5 2024
Favicon
Navigating On-Call Compensation in the Tech Industry In 2023
Favicon
On-Call 101: How to begin
Favicon
SRE book notes: Being On-Call
Favicon
PagerDuty Community Year in Review: 2022
Favicon
What is on-call, and why is it important?
Favicon
Introducing the On-Call Me Maybe Podcast!
Favicon
Ask Austin: Putting The IR into ObseRvabIlity
Favicon
Dear PagerDuty, When Am I On Call?
Favicon
Better Sleep with PagerDuty Dynamic Notifications and Support Hours

Featured ones: