dev-resources.site
for different kinds of informations.
Demystifying Observability 2.0
Our systems have gotten complex. Like really complex. Organizations have mostly shifted from monoliths to microservices. Theyāve embraced the Cloud, and with it, Kubernetes (PS: happy 10th b-day to Kubernetes!) and all sorts of other cloud native tools that help run the things that weāve grown accustomed to having in our tech-dependent lives: access to government services, social media, airline booking, shopping, streaming services, and so on.
As our systems get more and more complex, engineers need a way to understand them when things go š©, so that services can be restored in a timely manner.
Enter Observability, which helps with just that. Observability has been around for a while now, and itās been really exciting to see so many organizations embarking on their respective observability journeys.
Now, if youāve been following the interwebs, you may have heard some rumblings about Observability 2.0. Cool. But what is it really, and how does it differ from Observability 1.0? Well, youāve come to the right place. Sit back, relax, and let me take you on a journey.
Defining Observability
Before we get into Observability 1.0 vs 2.0, letās start with a definition of Observability, also known as o11y to us folks who sometimes get lazy and donāt want to write out the whole word. š (For the uninitiated: o11y == the 11 letters between āoā and āyā in āObservabilityā.)
The āclassicā definition of Observability comes from control theory:
Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
ā Rudolf E. KĆ”lmĆ”n
This definition was popularized by Charity Majors.
I love this definition, and Iāve used it for many years, including in my very first blog post on Observability, and more recently, in my OāReilly Observability video course.
That being said, thereās a refinement to the definition of Observability that Iāve been embracing of late, which was coined by my good friend, Hazel Weakly, who has an amazing blog post on redefining Observability. (Hazel is also incredibly smart and super astute and you should totally follow her on LinkedIn):
Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.
ā Hazel Weakly
Itās so simple, and so elegant, and I love it. Also, it applies to both Observability 1.0 and 2.0, and does not hold us back from continuing to refine Observability.
Okay, now that weāve gotten the basics out of the way, letās tackle this 1.0 vs. 2.0 business.
I set out to write this piece because Iāve found myself talking a lot about Observability 2.0 recently, including last week on Whitney Leeās Enlightning show, and in an upcoming episode of The Cloud Gambit. After all this talking about it, I wanted a place to jot down my thoughts, and to also share them with yāall. I honestly thought it would be a straight regurgitation of what Iād already said. But then I asked Hazel to look over this piece, and her feedback encouraged me to think about this further, thereby refining some of my understanding and thoughts around this. Which is awesome, because itās so fitting, given that Iām talking about the evolution of our understanding of Observability!
Observability 1.0
When Observability burst onto the scene, it was still a very APM-dominated world. Many APM vendors, sensing that Observability was becoming an Actual Thing, pivoted to Observability. This pivot, however, was mostly in name only, in much the same way that many organizations pivoted from Ops to DevOps (or SRE or Platform Engineering) in name only. New name, but business as usual. And perhaps we canāt blame them for that. These are paradigm shifts and paradigm shifts are often hard to swallow. Youāve gotta start somewhere, and maybe a name change is as good a place as any.
So, time for the big revealā¦Observability 1.0 is APM. But more specifically, what is Observability 1.0? Observability 1.0 is focused on:
1- Yow you operate your code
This means that itās more of an Ops concern, and not so much of an Everyone concern.
2- Known unknowns
Also known as āpredicable shit happensā. We know the usual things that go wrong with our systems, and we put dashboards in place to represent all of things that we know can go wrong with our systems (and for which we know the fixes), so that we can keep an eye on things if they go sideways.
3- Multiple sources of truth
These āsources of truthā are traces, metrics, and logs, also known as āThe Three Pillarsā. I actually hate that term, because it implies that these things are siloed from one another (more on that later). I much prefer the term āsignalā. A signal is anything that gives you data.
I suppose that the whole Three Pillars thing kind of makes sense for Observability 1.0, where traces, metrics, and logs were often not correlated. This is especially true since, in the early days of Observability, we didnāt really have a common language for even talking about these signals. Each vendor had their own standard, and that may or may not have included a way to correlate the three signals.
I also want to add that there was much more of an emphasis on logs and metrics, because thatās just something that developers and operators are familiar with. Traces have been around, but were not very widely used.
Observability 2.0
So now that we know what Observability 1.0 is all about, letās look at how it differs from Observability 2.0.
First things first. Credit where credit is due. The term āObservability 2.0ā was coined by Charity Majors. Observability 2.0 is the acknowledgment that Observability, like all things tech and non-tech, continues to evolve. The evolution to Observability 2.0 is the recognition that we made a decent stab at Observability (i.e. 1.0), but unfortunately, it didnāt really fulfill the promise of the definition of Observability that we saw earlier on. No problem, because things are constantly evolving.
So what makes Observability 2.0 different from 1.0? It has the following characteristics:
1- Itās focused not only on how you operate your code, but also on how you develop your code
This means that Observability is part of the systems development lifecycle (SDLC), and is therefore a concern of developers, QAs, and SREs. How?
Developers instrument their code so that they can troubleshoot it during development. š¤Æ Instrumentation is the process of adding code to software to generate telemetry signals for Observability purposes. Software engineers already rely on logging for troubleshooting (hello, āprintā statements?), so why not add traces and metrics into the mix?
Quality Assurance (QA) analysts leverage instrumented code during testing. When they encounter a bug, QAs can use telemetry data to enable them to troubleshoot code and file more detailed bug reports to developers. Or, if theyāre unable to troubleshoot the code with the telemetry provided, it means that the system has not been sufficiently instrumented. Again, they go back to developers with that information so that developers can add more instrumentation to the code.
QAs further take advantage of instrumented code by creating trace-based tests (TBT) for integration testing. In a nutshell, TBT leverages traces to create integration tests. For anyone interested in seeing TBT in action, the OpenTelemetry Demo leverages TBT using the opens source version of Tracetest.
SREs leverage instrumented code to create service-level objectives (SLOs). SLOs help us answer the question, āWhat is the reliability goal of this service?ā SLO are based on Service Level Indicators (SLIs), which are themselves based on metrics. Metrics that were instrumented by your developer! š¤Æ SREs can create alerts based on these SLOs, so that when an SLO is breached, theyāre notified right away. Furthermore, since the SLO is ultimatley tied a metric (via an SLI), which was correlated to a trace (more on signal correlation shortly), the SRE knows where to start looking when an issue arises in production.
**CI/CD pipelines are instrumented. **CI/CD pipelines are the backbone of modern SDLC. They are responsible for packaging and delivering code to production in a timely manner. When they fail, we canāt get code into production, which means angry users. Nobody likes angry useres. Ever. Therefore, having observable CI/CD pipelines allows us to address pipeline failures in a more timely manner to help alleviate software delivery bottlenecks.
2- Itās focused on unknown unknowns
Also known as āunpredictable shit happensā. Letās face it, you canāt know every problem that thereās ever going to be. This is especially true in the world of microservices, where services interact with each other in such weird and unpredictable ways becauseā¦well, we users tend to use systems in very weird and unpredictable ways! š¤Æ Traditional dashboards canāt save you, but SLO-based alerts can.
3- Itās focused on a single source of truth: events
Waitā¦what? What about traces, metrics, and logs? Well, traces, metrics, and logs all types of events. An event is information about a thing that happened. They are structured (think JSON-like), and timestamped. Traces, metrics, and logs are therefore different types of events that serve different and important purposes, each contributing to the Observability story. Furthermore, theyāre all correlated. Instead of Three Pillars, theyāre more like the three strands that make up a braid (shoutout to my teammate Ted Young for this analogy).
In addition, we now have a common standard for defining and correlating traces, metrics, and logs: OpenTelemetry. Most Observability vendors are all in on OpenTelemetry, which means that it has become the de-facto standard for instrumenting code (and also the second most popular CNCF project in terms of contributions š). It also means that these vendors all ingest the same data, and itās up to how those vendors render the data that differentiates them from one other.
I also want to add that in this Observability story, we place traces front and center, since they help give us that end-to-end picture of what happens when someone does a thing to a system, with metrics and logs serving as supporting actors which add useful details to that picture. And of course, everything correlated.
Final thoughts
Observability has come a long way from its early days, and Observability 2.0 is the acknowledgement that Observability is evolving, and most importantly, that weāre getting closer and closer to fulfilling the promise of Observability itself.
I canāt wait to see what the future has in store!
Now, please enjoy this photo of my rat Katie, enjoying some hangtime in the pocket of my husbandās bathrobe. š
Until next time, peace, love, and code. āļøšš©āš»
Featured ones: