Logo

dev-resources.site

for different kinds of informations.

Reliability in Legacy Software

Published at
3/18/2024
Categories
legacy
reliability
sre
Author
jfhbrook
Categories
3 categories in total
legacy
open
reliability
open
sre
open
Author
8 person written this
jfhbrook
open
Reliability in Legacy Software

Consider this hypothetical situation:

You have a service. It's a high-traffic service that most requests to you application touch. For example, maybe it involves user logins and sessions.

Because your organization takes reliability seriously, your service has SLOs. One SLO is on round-trip response time - after all, it represents an overhead on most requests. Another is uptime - after all, if this service goes down, so does everything else. These SLOs meet high standards, and are healthy.

But this service is also legacy software. As suggested previously, the team which owns the service doesn't have a good understanding of the service, and therefore struggles to make changes.

But in fact, suppose the owning team isn't interested in learning about the software or making the codebase mobile. Perhaps they dream of sunsetting the service in favor of a full rewrite. Maybe that rewrite is just a dream - or if it's in-flight, it's stuck in a quagmire of endless redesign or excess ambition. But they don't feel any particular urgency. They may feel the rewrite is worth "getting right", or that there are bigger fish to fry.

But regardless of their reasons, they'll note that the service is easily meeting its SLOs. It was written in a highly performant, if idiosyncratic language, and uses patterns which give it a high level of resilience and the ability to recover from many situations automatically. The service is steady as a rock, and left to its own devices will more or less chug along indefinitely once deployed.

However, it must be noted that the service doesn't do what the product team wishes it did. Its behavior made sense years ago, when the service was living and breathing. But as it calcified into its current form, it stopped keeping up with new requirements. This is justified by insisting that these requirements aren't really requirements (just nice-to-haves) or that the full rewrite will meet those requirements - and more. Despite the contradictions steadily continuing to heighten, the team insists that it's worth waiting for the full rewrite.

My question to you is this: Is this service reliable?

The answer is not clear-cut. If you define reliability solely in terms of providing what's expected of it, and the SLOs are an accurate representation of "providing", then this sure seems to be the case. After all, the service is meeting its SLOs handedly.

But is the service really providing what's expected? In a sense, it is - it's expected to do what it was originally written to do years ago. But SREs are supposed to be sensitive to product requirements. After all, if a service isn't available, then it by definition can't be delivering product value. Moreover, we're meant to choose SLOs which balance reliability targets with velocity.

I would suggest that reliability means not just "working", but doing so at velocity. Delivering product value doesn't just mean being available. It also means being able to change, in order to add new features, change old features, or otherwise meet the current needs of the product. But this may be a matter of semantics. If reliability does simply mean "working", then the argument for velocity still stands - it simply means that reliability isn't the entire picture.

Regardless of whether or not velocity is a requirement for reliability, I would encourage practitioners of software reliability to treat the discipline holistically. Reliability is surely important for software, especially as the product becomes ever more complex and depended on by more and more customers with increasingly steep reliability requirements. But the product needs to evolve with the needs of these customers and the business as well.

A service which can't change is one which can't resolve its internal tensions. Left on their own, they will inevitably become worse with time, regardless of whether or not the service is stable. The solution is to invest in making the codebase mobile, and believing that better things truly are possible.

reliability Article's
30 articles in total
Favicon
SRE Culture Embedding Reliability into Engineering Teams
Favicon
Understanding Idempotency in API
Favicon
Navigating Software Resiliency: A Comprehensive Classification
Favicon
60 Years of the IBM System/360: A Legacy of Reliability and Security
Favicon
Reliability in Legacy Software
Favicon
Azure Site Recovery
Favicon
A simple guide to addressing single point of failure (SPOF) while evaluating external tools
Favicon
How to design Reliable Microservice Chains using the principles of Systems Thinking.
Favicon
Reliability concepts: Availability, Resiliency, Robustness, Fault-Tolerance, and Reliability
Favicon
Lessons in Reliability: Margaret Hamilton's Software Engineering Approach
Favicon
Understanding Observability in Software Distributed Systems
Favicon
Ensuring reliability: SLOs, on-call process, and postmortems
Favicon
Building Resilient Software Architecture: Lessons Learned from the Domino Game
Favicon
10 most important Metrics you must know as a DevOps Engineer
Favicon
10 Most Effective Strategies to ensure reliability of the system
Favicon
Saving 30% on costs and improve infrastructure reliability with profiling
Favicon
"Building Secure and Reliable Systems": How Google's Approach to Security and Reliability Can Benefit Your Organization
Favicon
SLO Anti-Patterns: Real-World Lessons
Favicon
Building Resilient Systems on AWS: Avoiding Common Errors with the Well-Architected Framework
Favicon
SRE book notes: Introduction to Site Reliability Engineering
Favicon
PagerDuty Community Update: November 18, 2022
Favicon
5 key points about Immutable Infrastructure
Favicon
What about off-grid programming?
Favicon
Delivering 100% of Webhooks
Favicon
Observability is becoming mission critical, but who watches the watchmen?
Favicon
Availability Service Level Calculation
Favicon
Reliability Restaurant – How to approach software reliability as a mindset
Favicon
Delinearized Rollouts
Favicon
Submitting Changes
Favicon
Multi-Version Rollouts

Featured ones: