dev-resources.site

for different kinds of informations.

Reliability in Legacy Software

Published at

3/18/2024

Categories

3 categories in total

Author

8 person written this

jfhbrook

open

Consider this hypothetical situation:

You have a service. It's a high-traffic service that most requests to you application touch. For example, maybe it involves user logins and sessions.

Because your organization takes reliability seriously, your service has SLOs. One SLO is on round-trip response time - after all, it represents an overhead on most requests. Another is uptime - after all, if this service goes down, so does everything else. These SLOs meet high standards, and are healthy.

But this service is also legacy software. As suggested previously, the team which owns the service doesn't have a good understanding of the service, and therefore struggles to make changes.

But in fact, suppose the owning team isn't interested in learning about the software or making the codebase mobile. Perhaps they dream of sunsetting the service in favor of a full rewrite. Maybe that rewrite is just a dream - or if it's in-flight, it's stuck in a quagmire of endless redesign or excess ambition. But they don't feel any particular urgency. They may feel the rewrite is worth "getting right", or that there are bigger fish to fry.

But regardless of their reasons, they'll note that the service is easily meeting its SLOs. It was written in a highly performant, if idiosyncratic language, and uses patterns which give it a high level of resilience and the ability to recover from many situations automatically. The service is steady as a rock, and left to its own devices will more or less chug along indefinitely once deployed.

However, it must be noted that the service doesn't do what the product team wishes it did. Its behavior made sense years ago, when the service was living and breathing. But as it calcified into its current form, it stopped keeping up with new requirements. This is justified by insisting that these requirements aren't really requirements (just nice-to-haves) or that the full rewrite will meet those requirements - and more. Despite the contradictions steadily continuing to heighten, the team insists that it's worth waiting for the full rewrite.

My question to you is this: Is this service reliable?

The answer is not clear-cut. If you define reliability solely in terms of providing what's expected of it, and the SLOs are an accurate representation of "providing", then this sure seems to be the case. After all, the service is meeting its SLOs handedly.

But is the service really providing what's expected? In a sense, it is - it's expected to do what it was originally written to do years ago. But SREs are supposed to be sensitive to product requirements. After all, if a service isn't available, then it by definition can't be delivering product value. Moreover, we're meant to choose SLOs which balance reliability targets with velocity.

I would suggest that reliability means not just "working", but doing so at velocity. Delivering product value doesn't just mean being available. It also means being able to change, in order to add new features, change old features, or otherwise meet the current needs of the product. But this may be a matter of semantics. If reliability does simply mean "working", then the argument for velocity still stands - it simply means that reliability isn't the entire picture.

Regardless of whether or not velocity is a requirement for reliability, I would encourage practitioners of software reliability to treat the discipline holistically. Reliability is surely important for software, especially as the product becomes ever more complex and depended on by more and more customers with increasingly steep reliability requirements. But the product needs to evolve with the needs of these customers and the business as well.

A service which can't change is one which can't resolve its internal tensions. Left on their own, they will inevitably become worse with time, regardless of whether or not the service is stable. The solution is to invest in making the codebase mobile, and believing that better things truly are possible.

reliability Article's

30 articles in total