Logo

dev-resources.site

for different kinds of informations.

Demystifying Observability 2.0

Published at
6/10/2024
Categories
observability
opentelemetry
cloudnative
apm
Author
avillela
Author
8 person written this
avillela
open
Demystifying Observability 2.0

Staircase in the Sagrada Famiglia, Barcelona, Spain. Photo by Adriana Villela.

Our systems have gotten complex. Like really complex. Organizations have mostly shifted from monoliths to microservices. Theyā€™ve embraced the Cloud, and with it, Kubernetes (PS: happy 10th b-day to Kubernetes!) and all sorts of other cloud native tools that help run the things that weā€™ve grown accustomed to having in our tech-dependent lives: access to government services, social media, airline booking, shopping, streaming services, and so on.

As our systems get more and more complex, engineers need a way to understand them when things go šŸ’©, so that services can be restored in a timely manner.

Enter Observability, which helps with just that. Observability has been around for a while now, and itā€™s been really exciting to see so many organizations embarking on their respective observability journeys.

Now, if youā€™ve been following the interwebs, you may have heard some rumblings about Observability 2.0. Cool. But what is it really, and how does it differ from Observability 1.0? Well, youā€™ve come to the right place. Sit back, relax, and let me take you on a journey.

Defining Observability

Before we get into Observability 1.0 vs 2.0, letā€™s start with a definition of Observability, also known as o11y to us folks who sometimes get lazy and donā€™t want to write out the whole word. šŸ™ƒ (For the uninitiated: o11y == the 11 letters between ā€œoā€ and ā€œyā€ in ā€œObservabilityā€.)

The ā€œclassicā€ definition of Observability comes from control theory:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

ā€“ Rudolf E. KĆ”lmĆ”n

This definition was popularized by Charity Majors.

I love this definition, and Iā€™ve used it for many years, including in my very first blog post on Observability, and more recently, in my Oā€™Reilly Observability video course.

That being said, thereā€™s a refinement to the definition of Observability that Iā€™ve been embracing of late, which was coined by my good friend, Hazel Weakly, who has an amazing blog post on redefining Observability. (Hazel is also incredibly smart and super astute and you should totally follow her on LinkedIn):

Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.

ā€“ Hazel Weakly

Itā€™s so simple, and so elegant, and I love it. Also, it applies to both Observability 1.0 and 2.0, and does not hold us back from continuing to refine Observability.

Okay, now that weā€™ve gotten the basics out of the way, letā€™s tackle this 1.0 vs. 2.0 business.

I set out to write this piece because Iā€™ve found myself talking a lot about Observability 2.0 recently, including last week on Whitney Leeā€™s Enlightning show, and in an upcoming episode of The Cloud Gambit. After all this talking about it, I wanted a place to jot down my thoughts, and to also share them with yā€™all. I honestly thought it would be a straight regurgitation of what Iā€™d already said. But then I asked Hazel to look over this piece, and her feedback encouraged me to think about this further, thereby refining some of my understanding and thoughts around this. Which is awesome, because itā€™s so fitting, given that Iā€™m talking about the evolution of our understanding of Observability!

Summary of Observability 1.0 vs 2.0 from my appearance on Enlightning with Whitney Lee.

Observability 1.0

When Observability burst onto the scene, it was still a very APM-dominated world. Many APM vendors, sensing that Observability was becoming an Actual Thing, pivoted to Observability. This pivot, however, was mostly in name only, in much the same way that many organizations pivoted from Ops to DevOps (or SRE or Platform Engineering) in name only. New name, but business as usual. And perhaps we canā€™t blame them for that. These are paradigm shifts and paradigm shifts are often hard to swallow. Youā€™ve gotta start somewhere, and maybe a name change is as good a place as any.

So, time for the big revealā€¦Observability 1.0 is APM. But more specifically, what is Observability 1.0? Observability 1.0 is focused on:

1- Yow you operate your code

This means that itā€™s more of an Ops concern, and not so much of an Everyone concern.

2- Known unknowns

Also known as ā€œpredicable shit happensā€. We know the usual things that go wrong with our systems, and we put dashboards in place to represent all of things that we know can go wrong with our systems (and for which we know the fixes), so that we can keep an eye on things if they go sideways.

3- Multiple sources of truth

These ā€œsources of truthā€ are traces, metrics, and logs, also known as ā€œThe Three Pillarsā€. I actually hate that term, because it implies that these things are siloed from one another (more on that later). I much prefer the term ā€œsignalā€. A signal is anything that gives you data.

I suppose that the whole Three Pillars thing kind of makes sense for Observability 1.0, where traces, metrics, and logs were often not correlated. This is especially true since, in the early days of Observability, we didnā€™t really have a common language for even talking about these signals. Each vendor had their own standard, and that may or may not have included a way to correlate the three signals.

I also want to add that there was much more of an emphasis on logs and metrics, because thatā€™s just something that developers and operators are familiar with. Traces have been around, but were not very widely used.

Observability 2.0

So now that we know what Observability 1.0 is all about, letā€™s look at how it differs from Observability 2.0.

First things first. Credit where credit is due. The term ā€œObservability 2.0ā€ was coined by Charity Majors. Observability 2.0 is the acknowledgment that Observability, like all things tech and non-tech, continues to evolve. The evolution to Observability 2.0 is the recognition that we made a decent stab at Observability (i.e. 1.0), but unfortunately, it didnā€™t really fulfill the promise of the definition of Observability that we saw earlier on. No problem, because things are constantly evolving.

So what makes Observability 2.0 different from 1.0? It has the following characteristics:

1- Itā€™s focused not only on how you operate your code, but also on how you develop your code

This means that Observability is part of the systems development lifecycle (SDLC), and is therefore a concern of developers, QAs, and SREs. How?

Developers instrument their code so that they can troubleshoot it during development. šŸ¤Æ Instrumentation is the process of adding code to software to generate telemetry signals for Observability purposes. Software engineers already rely on logging for troubleshooting (hello, ā€œprintā€ statements?), so why not add traces and metrics into the mix?

Quality Assurance (QA) analysts leverage instrumented code during testing. When they encounter a bug, QAs can use telemetry data to enable them to troubleshoot code and file more detailed bug reports to developers. Or, if theyā€™re unable to troubleshoot the code with the telemetry provided, it means that the system has not been sufficiently instrumented. Again, they go back to developers with that information so that developers can add more instrumentation to the code.

QAs further take advantage of instrumented code by creating trace-based tests (TBT) for integration testing. In a nutshell, TBT leverages traces to create integration tests. For anyone interested in seeing TBT in action, the OpenTelemetry Demo leverages TBT using the opens source version of Tracetest.

SREs leverage instrumented code to create service-level objectives (SLOs). SLOs help us answer the question, ā€œWhat is the reliability goal of this service?ā€ SLO are based on Service Level Indicators (SLIs), which are themselves based on metrics. Metrics that were instrumented by your developer! šŸ¤Æ SREs can create alerts based on these SLOs, so that when an SLO is breached, theyā€™re notified right away. Furthermore, since the SLO is ultimatley tied a metric (via an SLI), which was correlated to a trace (more on signal correlation shortly), the SRE knows where to start looking when an issue arises in production.

**CI/CD pipelines are instrumented. **CI/CD pipelines are the backbone of modern SDLC. They are responsible for packaging and delivering code to production in a timely manner. When they fail, we canā€™t get code into production, which means angry users. Nobody likes angry useres. Ever. Therefore, having observable CI/CD pipelines allows us to address pipeline failures in a more timely manner to help alleviate software delivery bottlenecks.

2- Itā€™s focused on unknown unknowns

Also known as ā€œunpredictable shit happensā€. Letā€™s face it, you canā€™t know every problem that thereā€™s ever going to be. This is especially true in the world of microservices, where services interact with each other in such weird and unpredictable ways becauseā€¦well, we users tend to use systems in very weird and unpredictable ways! šŸ¤Æ Traditional dashboards canā€™t save you, but SLO-based alerts can.

3- Itā€™s focused on a single source of truth: events

Waitā€¦what? What about traces, metrics, and logs? Well, traces, metrics, and logs all types of events. An event is information about a thing that happened. They are structured (think JSON-like), and timestamped. Traces, metrics, and logs are therefore different types of events that serve different and important purposes, each contributing to the Observability story. Furthermore, theyā€™re all correlated. Instead of Three Pillars, theyā€™re more like the three strands that make up a braid (shoutout to my teammate Ted Young for this analogy).

In addition, we now have a common standard for defining and correlating traces, metrics, and logs: OpenTelemetry. Most Observability vendors are all in on OpenTelemetry, which means that it has become the de-facto standard for instrumenting code (and also the second most popular CNCF project in terms of contributions šŸŽ‰). It also means that these vendors all ingest the same data, and itā€™s up to how those vendors render the data that differentiates them from one other.

I also want to add that in this Observability story, we place traces front and center, since they help give us that end-to-end picture of what happens when someone does a thing to a system, with metrics and logs serving as supporting actors which add useful details to that picture. And of course, everything correlated.

Final thoughts

Observability has come a long way from its early days, and Observability 2.0 is the acknowledgement that Observability is evolving, and most importantly, that weā€™re getting closer and closer to fulfilling the promise of Observability itself.

I canā€™t wait to see what the future has in store!

Now, please enjoy this photo of my rat Katie, enjoying some hangtime in the pocket of my husbandā€™s bathrobe. šŸ’œ

Katie rat enjoying some pocket hangtime. Photo by Adriana Villela

Until next time, peace, love, and code. āœŒļøšŸ’œšŸ‘©ā€šŸ’»

apm Article's
30 articles in total
Favicon
Monitoring PM2 in production
Favicon
Recommended Series for Small Rails Development Teams: APM - Skylight
Favicon
Enhance Your System Observability with These Top Log Monitoring Tools
Favicon
Real-Time News Aggregator with Elastic: Leveraging APM, RUM, and Elasticsearch for Optimized Performance
Favicon
What is APMā€”A comprehensive overview on managing application performance
Favicon
Monitor the Performance of Your Ruby on Rails Application Using AppSignal
Favicon
Demystifying Observability 2.0
Favicon
Elasticsearch APM Server Kurulumu ve Uygulama Ä°zleme
Favicon
Considerations for effective application performance management: Areas to look out for
Favicon
Measuring Node.js Performance in Production with Performance Hooks
Favicon
Uptrace v1.6 is available
Favicon
Tracking Custom Metrics in Python with AppSignal
Favicon
Integrating Keycloak with Datadog: Enabling Keycloak Traces in Kubernetes using Datadog APM
Favicon
A deep dive into zero-day vulnerability alerts with New Relic APM
Favicon
Datadog PHP APM filtering
Favicon
How to Gracefully Implement Business Performance Monitoring
Favicon
AppSignal Expands Monitoring Capabilities with Vector
Favicon
NestJS APM with Elastic and Docker
Favicon
Integrations to monitor your full AI stack
Favicon
Shaping the Future of Ruby and Kafka Together with rdkafka-ruby
Favicon
AppSignal Monitoring Available for Python Applications
Favicon
New Relic AI Monitoring, the industryā€™s first APM for AI, is here!
Favicon
We've Levelled up Our Top Monitoring Features
Favicon
Monitor Your Node.js and Remix Application with AppSignal
Favicon
How To Reduce Reductions in Elixir
Favicon
Structured logging best practices
Favicon
Introduction to Go app monitoring
Favicon
Boost HTTP Client Monitoring in Elixir with AppSignal and Tesla Templates
Favicon
Track Errors in Fastify with AppSignal
Favicon
Observe Your Phoenix App with Structured Logging

Featured ones: