dev-resources.site

for different kinds of informations.

Ask Austin: Putting The IR into ObseRvabIlity

Published at

4/29/2022

Schrodinger's U+1F408

The most crucial currency of a startup is time — you're under immense pressure to ship features and write reliable, resilient code. Into every life a little rain must fall, though, and failure seems inevitable. Some people might tell you that failure can be boxed away where it won't ever happen, but it really is impossible to create something perfect. We have to understand, tolerate, and ameliorate failure rather than shy away from it.

Observability is absolutely critical to this effort. Without observability, you can't quantify failure and know how to prioritize it. More specifically, without observability, you aren't able to connect system state to business outcomes using SLO's, which means you lack the ability to understand failure in the context of what you're actually trying to do.

This is where things get tricky, though — [your SLO's tell you when performance is impacting users (https://go.lightstep.com/register-the-secret-to-high-quality-slos-on-demand-webinar.html), but how do you start to tolerate, and ameliorate that failure? This is where observability alone can't help — you need to adopt best practices and tools that can aid in communication and recovery efforts.

Building a Culture of Resiliency

In order to respond and mitigate failure in your system, it's not enough to be aware of it. Your ability to respond starts before failures even happen, though — before you even write the first line of code in a service. You need to start out the way you intend to keep going. I'm going to list three strategies you should adopt in order to build a culture of resiliency in your engineering team:

Monitoring and observability should be 'day zero' problems. Ensure that every new feature or service has appropriate tracing, metrics, and logging telemetry being output and captured through OpenTelemetry tools, and test these telemetry outputs as part of your CI/CD process.
Effective incident response relies on documentation and communication. Create a defined process for building playbooks that include links to dashboards, explanations of alerts and SLOs, and service ownership. Keep them up to date as part of your recurring engineering workload — especially as the team grows or contracts. If you're writing everything down as you go along, you're more resilient to unexpected changes in team composition.
Understand alerts before you build them. Ideally, you should only be alerting off of SLOs for application behavior! I've seen too many people get burnt out by blindly setting up alerts for their infrastructure resources and spend hours or days chasing down failures that didn't actually matter that much to their customers. Alerts need tending and grooming, just like anything else — you can try it yourself with your existing alerts. Spend a week on-call noting if a triggered alert actually impacted the customer experience in some way.

Incidents Happen

It's inevitable that things will break, so we need to build systems that tolerate failure. Observability and incident response tools are helpful in this quest. Observability tells you where, and why, a failure occurred. Incident response tools help you manage and respond to those failures. The real question you need to ask yourself is how much time you want to spend on the management of your observability and incident response platforms.

While tools such as Prometheus and Upptime will let you monitor, alert, and communicate incidents to your users, there's a lot more than just tracking metrics and pinging your page to really understand your system. The cost of stitching together your own bespoke solution in terms of time alone can be weeks or months, which can strain already over-extended engineers. Moreover, wouldn't you rather be working on things that provide value to your end-users rather than spinning up Grafana again?

Traditional paid services are an option, but many of them incur a management cost. Seat-based pricing that controls how many people can access the tool is a pain, especially since most companies make single-sign on an 'enterprise' feature. This isn't universally true — Lightstep has no limits on the amount of users of our tools, because your data is useful to all of your engineers, not just a few. If you’re already feeling underwater on managing alert fatigue, try us out and see the difference for yourself.

If this all seems overwhelming, I understand — especially if you're at a smaller company or on a smaller team, some of this probably feels like stuff you haven't started to really deal with in earnest yet. The best advice I can give you is to find your peers and build community with them. All of the advice in the world is great, but finding someone that's been where you are and can help direct you along the way is even better. If you're interested in being a part of ours, you can join us on our Discord.

I hope this answered a few of y'alls questions about observability and incident response. If you've got more burning questions, feel free to hit me up on our Discord server or on Twitter and I'll answer them next time!

oncall Article's

30 articles in total

Simplify On-Call Management with a Modern Incident Management and Incident Response Platform