Logo

dev-resources.site

for different kinds of informations.

Ask Austin: Putting The IR into ObseRvabIlity

Published at
4/29/2022
Categories
observability
oncall
devops
tooling
Author
austinlparker
Author
13 person written this
austinlparker
open
Ask Austin: Putting The IR into ObseRvabIlity

If you've listened to me on a podcast, or spoken to me online, you might have heard me claim that 'observability is the foundation that you build on.' What does that mean, though? In this edition of 'Ask Austin,' I'll share some insights Iโ€™ve picked up talking to our customers about how observability is a foundational component of DevOps practices, such as incident response.

Schrodinger's U+1F408

The most crucial currency of a startup is time โ€” you're under immense pressure to ship features and write reliable, resilient code. Into every life a little rain must fall, though, and failure seems inevitable. Some people might tell you that failure can be boxed away where it won't ever happen, but it really is impossible to create something perfect. We have to understand, tolerate, and ameliorate failure rather than shy away from it.

Observability is absolutely critical to this effort. Without observability, you can't quantify failure and know how to prioritize it. More specifically, without observability, you aren't able to connect system state to business outcomes using SLO's, which means you lack the ability to understand failure in the context of what you're actually trying to do.

This is where things get tricky, though โ€” [your SLO's tell you when performance is impacting users (https://go.lightstep.com/register-the-secret-to-high-quality-slos-on-demand-webinar.html), but how do you start to tolerate, and ameliorate that failure? This is where observability alone can't help โ€” you need to adopt best practices and tools that can aid in communication and recovery efforts.

Building a Culture of Resiliency

In order to respond and mitigate failure in your system, it's not enough to be aware of it. Your ability to respond starts before failures even happen, though โ€” before you even write the first line of code in a service. You need to start out the way you intend to keep going. I'm going to list three strategies you should adopt in order to build a culture of resiliency in your engineering team:

  • Monitoring and observability should be 'day zero' problems. Ensure that every new feature or service has appropriate tracing, metrics, and logging telemetry being output and captured through OpenTelemetry tools, and test these telemetry outputs as part of your CI/CD process.

  • Effective incident response relies on documentation and communication. Create a defined process for building playbooks that include links to dashboards, explanations of alerts and SLOs, and service ownership. Keep them up to date as part of your recurring engineering workload โ€” especially as the team grows or contracts. If you're writing everything down as you go along, you're more resilient to unexpected changes in team composition.

  • Understand alerts before you build them. Ideally, you should only be alerting off of SLOs for application behavior! I've seen too many people get burnt out by blindly setting up alerts for their infrastructure resources and spend hours or days chasing down failures that didn't actually matter that much to their customers. Alerts need tending and grooming, just like anything else โ€” you can try it yourself with your existing alerts. Spend a week on-call noting if a triggered alert actually impacted the customer experience in some way.

Incidents Happen

It's inevitable that things will break, so we need to build systems that tolerate failure. Observability and incident response tools are helpful in this quest. Observability tells you where, and why, a failure occurred. Incident response tools help you manage and respond to those failures. The real question you need to ask yourself is how much time you want to spend on the management of your observability and incident response platforms.

While tools such as Prometheus and Upptime will let you monitor, alert, and communicate incidents to your users, there's a lot more than just tracking metrics and pinging your page to really understand your system. The cost of stitching together your own bespoke solution in terms of time alone can be weeks or months, which can strain already over-extended engineers. Moreover, wouldn't you rather be working on things that provide value to your end-users rather than spinning up Grafana again?

Traditional paid services are an option, but many of them incur a management cost. Seat-based pricing that controls how many people can access the tool is a pain, especially since most companies make single-sign on an 'enterprise' feature. This isn't universally true โ€” Lightstep has no limits on the amount of users of our tools, because your data is useful to all of your engineers, not just a few. If youโ€™re already feeling underwater on managing alert fatigue, try us out and see the difference for yourself.

If this all seems overwhelming, I understand โ€” especially if you're at a smaller company or on a smaller team, some of this probably feels like stuff you haven't started to really deal with in earnest yet. The best advice I can give you is to find your peers and build community with them. All of the advice in the world is great, but finding someone that's been where you are and can help direct you along the way is even better. If you're interested in being a part of ours, you can join us on our Discord.

I hope this answered a few of y'alls questions about observability and incident response. If you've got more burning questions, feel free to hit me up on our Discord server or on Twitter and I'll answer them next time!

oncall Article's
30 articles in total
Favicon
Simplify On-Call Management with a Modern Incident Management and Incident Response Platform
Favicon
The Importance of On-Call Incident Response Software: Enhancing Business Resilience and Engineer Effectiveness
Favicon
Callgoose SQIBS is an effective Real-time Incident Management and Incident Response Platform for Work from Home (WFH) Teams
Favicon
Simplify On-Call Management with a Modern Incident Management and Incident Response Platform
Favicon
Amplify Your Response Team's Impact: Introducing Squadcastโ€™s Additional Responders
Favicon
Autocorrelate Alerts With Squadcastโ€™s Key-Based Deduplication
Favicon
Surviving Your First On-Call Shift: 5 Essential Tips
Favicon
How To Reduce The Alert Noise For Optimal On-Call Performance
Favicon
All-in-One Incident Management: Why Squadcast Trumps Separate On-Call and Alerting Tools
Favicon
Automating On-Call Scheduling With Squadcast: Simplify Managing Schedules
Favicon
Best Practices For Building A Resilient On-Call Framework
Favicon
Rolling Out a Robust On-Call Process to Your Team
Favicon
Configure an Intuitive Service Dashboard & Reduce Response Time
Favicon
Suppressing Alert Noise during Scheduled Maintenance
Favicon
Journey of Streamlining Oncall and Incident Management
Favicon
On-Call manual: Onboarding a new person to the on-call rotation
Favicon
Improving Customer Support with Squadcast Webforms: A Smart Solution for MSPs
Favicon
On-call Manual: Measuring the quality of the on-call
Favicon
Comprehensive Guide to On-Call Scheduling Software for Enhanced Incident Response
Favicon
PagerDuty Community Update, January 12 2024
Favicon
PagerDuty Community Update, January 5 2024
Favicon
Navigating On-Call Compensation in the Tech Industry In 2023
Favicon
On-Call 101: How to begin
Favicon
SRE book notes: Being On-Call
Favicon
PagerDuty Community Year in Review: 2022
Favicon
What is on-call, and why is it important?
Favicon
Introducing the On-Call Me Maybe Podcast!
Favicon
Ask Austin: Putting The IR into ObseRvabIlity
Favicon
Dear PagerDuty, When Am I On Call?
Favicon
Better Sleep with PagerDuty Dynamic Notifications and Support Hours

Featured ones: