dev-resources.site
for different kinds of informations.
Reliability Restaurant – How to approach software reliability as a mindset
Photograph by Life Of Pix from Pexels.
What is software reliability?
ScienceDirect defines software reliability as follows:
Software reliability is the probability of failure-free operation of a computer program for a specified period in a specified environment. Reliability is a customer-oriented view of software quality. It relates to operation rather than design of the program, and hence it is dynamic rather than static.
Another closely related term is resiliency, as defined by Microsoft:
Resiliency is the ability of a [...] service to withstand certain types of failure and yet remain functional from the customer perspective. In other words, reliability is the outcome and resilience is the way you achieve the outcome.
So for the sake of simplicity let's discuss reliability without specifying resiliency explicitly, we can assume it to be the other side of the coin implicitly.
What about a restaurant?
Reliability, like security, has much less to do with using the correct frameworks and libraries than with asking the right questions and having the right mindset. For the sake of discussing the principles rather than specifics, let's use a metaphor of a pancake restaurant rather than some computer program. Everyone knows what a restaurant does, and can imagine what failure-free operation looks like for one. The same mindset can then be applied to software systems.
Why care about reliability?
Looking at the definition of reliability, it seems like the thing to do to achieve "failure-free operation" is to avoid failure at any cost. But anyone who's ever written software knows that even with careful writing, reviewing and testing, bugs still manage to find their way to production regularly. More generally speaking, unexpected things and mistakes of different kinds are always bound to happen regardless of the activity, especially at scale. It's safe to assume that our pancake restaurant will run into some problems and failures in its operations, no matter the amount of planning and trying to avoid them.
If failure is inevitable, why talk about failure-free operation then? The definition of failure matters. Reliability is all about constructing the bigger system so that individual failures can be tolerated and recovered from, without failure to the overall system.
From a business owner's perspective, you want your system to be able to provide business value even if some component fails, instead of the whole system being rendered useless. If something fails, you want to be able to work around it. In the restaurant, if someone drops and breaks a plate – a failure – you don't want to have to close the restaurant as a result. Instead you want to be able to tolerate and work around such failures, keep the doors open, serve paying customers, and have your business running. Likewise you don't want every error to take your whole software system down.
Defining "failure-free operation" for a pancake restaurant
The properties of an operational restaurant are pretty much obvious, but worth listing explicitly for the sake of the example. So let's say that the restaurant must be able to:
- Take in customers and seat them in tables
- Take orders from them
- Prepare the dishes as ordered
- Serve the food to the customers
- Charge money from the customers
Let's keep it simple and limit the requirements to these five things. Looks kind of like a user story, doesn't it?
Now it's fairly easy to map out the best case scenario, or happy path:
- A customer walks in to the restaurant and sits at a table 🪑
- The waiter goes and gets their order – pancakes 🥞 – and takes that info to the kitchen
- The chef prepares a plate of delicious pancakes, all the ingredients are there readily available 🧑🍳
- The waiter takes the pancakes to the customer to eat 🍽
- After dining, the customer orders the check, pays for the food, and leaves happy 😋
To serve the one customer, only one waiter and one chef are needed, so let's say there are no more employees around. All seems good, right?
Diverting from the happy path – remaining operational under unexpected circumstances
The happy path looks simple enough. But what if something different happens? Let's say that Elon Musk tweets about the best pancakes he ever had at your restaurant (the customer was Elon Musk the whole time, bet you didn't see that coming! 😳), and the next day 10,000 people want to come try those pancakes.
Now what if:
- All the people rush in at the same time?
- Someone orders something that the chef doesn't know how to make?
- Some ingredient runs out in the kitchen?
- Some ingredient deliveries get cancelled due to the supplier?
- The waiter or chef is out sick?
- A batch of pancakes falls on the floor in the kitchen, requiring cleaning before cooking can resume?
- People try to pay, but your card reader device isn't working?
It's hard to imagine the happy path working under these circumstances, so let's look at the things one by one.
Overflow of customers
The restaurant can hardly operate in a satisfactory way if people march in and try to seat themselves uncontrollably, especially when the amount of people greatly exceeds the restaurant's seating capacity. What can be done?
A simple reliability mechanism that the restaurant can apply is instructing the customers to wait to be seated, and queue for that. This way the customers wait at the door or lobby in an ordered queue based on the order they arrived, and the restaurant (in this example probably the waiter) controls the way and rate in which that they get seated in tables. Some delay can be added to make time for cleaning the tables after the previous customers, resulting in much better customer experience. This is analogous to throttling and queuing in software.
If there are simply too many customers coming in for the restaurant to handle, the waiter can turn them away at the door and ask them to come back later. Not the best customer experience for sure, but this kind of expectation management is better for everyone than seating them and not being able to serve them. By only taking in the amount of customers the restaurant can handle, we can make sure to remain operational for those customers that we do take in. This is analogous to rate limiting in software – don't take in more requests than you can properly handle.
How to avoid having to turn people away at the door? Enter table reservations. This way the restaurant can control the amount and rate of customers ahead of time, and expectation management is even better for the customers, as it's nicer to be turned away at the phone, rather than after already walking to the restaurant. This is analogous to planned capacity, or scheduling, in software.
Note that it is possible and maybe even desirable to apply these same mechanisms also later on in the flow. For example the kitchen can take in orders as a queue and work on the dishes one at a time to control the work pressure for the chef. In general however it's good to apply these mechanisms as early in the flow as possible. It's better to have the whole restaurant serve a manageable amount of customers at once, than to have the dining room overflowing from hungry people while just the kitchen can work at a controlled pace.
Unexpected orders
Now that we have limited the amount of customers to serve at any given time, all seems to be good. But what if someone orders waffles instead of pancakes, and our kitchen lacks the required gear, ingredients or expertise to make waffles?
You can imagine what would happen to the kitchen's ability to produce dishes if the chef needs to start by looking up a recipe and going out to buy the needed ingredients and a waffle iron. Contrast that to having the chef know the recipe already, and having all that's needed for cooking it readily available in the kitchen.
So what's the reliability mechanism that can be applied here? Having a menu and only serving dishes from the menu. This is a way of managing the expectations for the chef and kitchen. It's much easier to prepare with the necessary knowhow, devices and ingredients, when the dishes that can be ordered are listed ahead of time and no surprises can happen. The software analogy here is a bit more vague, but think allowlisting inputs to limit what kind of requests are taken in for handling.
Ingredients running out
The amount of customers and unexpected orders are handled, but what happens if the kitchen runs out of flour during opening hours? Surely that will limit the capability of making pancakes.
In the happy path example it might be good enough to go to the supermarket next door to buy the ingredients for a single serving of pancakes only when the order arrives (the equivalent of lazy evaluation, just in time computation, or on demand fetching in software). Let's make an assumption that this won't take more than a couple of minutes. Not that this sounds like a reasonable thing for a restaurant's kitchen, but it would be doable. It starts to get less tolerable when the amount of customers and the expected throughput of pancakes in the kitchen is expected to grow.
At full service it might be catastrophic to the restaurant's operation (remember, being able to serve pancakes to paying customers in reasonable time) to run out of an important ingredient. So it makes sense to stock up with all the necessary ingredients in preparation for a day's work.
The software analogy is again not 100% accurate, but think caching. Instead of having to repeatedly read data from a database or API, or perform a repeated intensive computation, caching the results in memory allows much faster serving of responses back downstream.
Supplier cancelling
Due to various reasons such as limited storage space and ingredients expiring, it's not possible to stock up indefinitely. So regardless of preparations, new ingredients are continuously needed, maybe even during the opening hours.
What happens if the supplier calls to say that their delivery truck has a flat tire, and they cannot deliver the much needed flour? The kitchen will run out in one hour if new flour isn't received somehow.
To solve this problem the restaurant needs some kind of backup. Maybe the chef will make a run to the supermarket next door, as some delay in the kitchen is better than not being able to serve pancakes at all. Or maybe the restaurant has a backup supplier that they can call in a situation like this.
The software analogy is backups, replication and failover, all parts of the wider concept of high availability (HA). If you only have one database instance and it shuts down, your service cannot remain operational. But if you have a system that can automatically failover to a working database instance, the system can remain operational with minimal downtime. Note that you would need data replication for this to be feasible. Likewise you can run multiple parallel instances of your stateless application servers and balance the load between those, so that one application server failing won't disrupt the system's overall operation.
Waiter or chef being out sick
If there's just one waiter and one chef, either of them being out sick would be a major blocker for the restaurant's operation. Without a waiter no pancakes can be served to customers, and without a chef there are no pancakes to serve. We can safely identify both the waiter and the chef as a single point of failure (SPOF) for the restaurant's operation.
The obvious solution is to have multiple waiters and chefs working, or active backups for them that you can call in to work when needed – think high availability. We'll look at the financial implications at the end of this post, but it's obvious that there's a limit to how many waiters and chefs the restaurant can afford to have in rotation. So let's look at other potential solutions as well.
If the waiter is out sick, maybe the restaurant can try a form of self-service instead. Have the customers come make orders at the kitchen counter, and fetch the dishes from the kitchen when they are done, instead of waiting to be served. The payments can also be handled at the kitchen counter. Looks a little like how most fast food restaurants operate.
If the chef is out sick, maybe the restaurant can serve pre-made pancakes that only need heating, that can be handled by the waiter in a microwave. Definitely not as good as fresh ones, but perhaps better than nothing. Or maybe serve only drinks if pancakes cannot be cooked. Again, not what the customers expect, but maybe better than turning them away at the door, both for thirsty customers and the business.
Both of these examples are naive and not something that would make for a good experience, if the customers expect to be served fresh pancakes directly to the tables. In reality both of these examples are also such that they would require planning and implementation ahead of time to be feasible. But they both represent a way of working that's somehow "less" than the optimal, but still working to some extent, instead of being completely shut down.
The software analogy is partial, graceful failure. It depends on the case to what extent this is possible to implement, but in general you want to limit the blast radius of any single point of failure so that a failure won't render the whole system unusable. If nothing else, you probably want to catch errors and show the users something human readable, instead of blank screens or stack traces. Or maybe you want to show cached data if fresh data cannot be fetched. Or if your website cannot load images, maybe you want to have some placeholders instead.
Pancakes falling on the kitchen floor requiring cleaning
No matter how skilled and careful the chef is, accidents can and will happen. During the busiest hour, a fresh batch of pancakes falls on the kitchen floor, causing a big mess. Nothing catastrophic, but those pancakes obviously cannot be served to customers, and they need to be cleaned to avoid slipping. This means a delay for pancakes being served, and the chef needs to clean the kitchen before continuing cooking – time to recover.
Let's say the chef needs 10 minutes for cleaning. It's best that the waiter doesn't take in any new orders during that time. This way the chef gets to focus on cleaning instead of trying to multitask, and can start cooking again after that is done. As the customers will experience a delay in getting food, maybe the waiter can serve them drinks in the meanwhile instead. After the ten minutes needed for cleaning is passed, the waiter can check if the chef is ready, and if so, start taking in orders again.
The same tactic can be applied also if the kitchen gets a large batch of orders that will take a while to cook – maybe it's better not to take in more orders before those already placed have been finished. Easing the built-up pressure in the short term can be a lot more effective than constantly working under high pressure. Sometimes it's necessary to get some time dedicated fully for recovery.
This is analogous to circuit breakers in software. In this example the waiter represents a downstream client, and the chef an upstream server. The pattern is about closing the client (waiter) from making new requests (orders) to the server (chef) if a certain error condition (needing to clean the floor) is met.
Note that a circuit breaker is implemented specifically on the client's side. So in this example, if the chef has told the waiter they need 10 minutes of downtime for cleaning up, the waiter will not go to the kitchen counter with new orders for the next 10 minutes at all, but knows to wait. If a customer wants to place an order, the waiter can instantly inform them of the delay without asking the chef, and can offer drinks instead, promising to come back to the food subject in 10 minutes.
Compare this to server side patterns such as timeouts and queues, where the waiter would go to the kitchen counter for every new customer order, and be told to wait by the chef. The advantage of the circuit breaker pattern is that the chef doesn't need to respond to the queries for new orders at all during the 10 minutes, and can fully focus on recovering the situation, that is, cleaning the mess in this example.
Payments not working
Everything has gone well for the customers: the mood was on point, service top notch and pancakes absolutely delicious. All that's left is paying the bill, but alas, the card reader doesn't work.
While the customers might enjoy getting their food for free, that would certainly be undesirable for the restaurant. Remember, the happy path requires being able to charge money from the customers, otherwise the restaurant is hardly feasible business.
As for the actual card reader device malfunctioning, maybe you can have a backup reader available. Or if the problem is batteries running out, at least have spare batteries at hand. High availability again.
But maybe the problem is at the card processor's end, meaning that no amount of actions from the restaurant's side could avoid or solve that failure. In such a case you most likely want to have some backup way of collecting payments from the customers. Maybe you can take payments in cash, or even open a tab if you trust the customers to come back later to pay the bill.
In this example the card processor is a dependency, and specifically an external dependency. External, meaning that you cannot affect the reliability of the dependency itself, only how you build your system to depend on it. In this case you can mitigate the risk of the dependency failing by introducing backup mechanisms (cash or tabs). The same concepts of high availability apply as described earlier.
Prepare for the unexpected... by expecting it
When something unexpected happens, the restaurant staff can creatively improvise and adapt to the situation, work around it, and keep the business running. Computers however, despite some promising developments, tend to be less capable of creativity or improvising, and more capable of following precise orders, given as the source code and other artefacts by the programmers. So the burden of creativity must remain mostly on the programmer.
In order for your software to be able to handle unexpected situations, you have to build it as such. And in order to be able to build in the necessary reliability mechanisms for different situations, you must be able to creatively think about the different situations that might occur. Instead of trying to avoid failures altogether, build your system to be able to tolerate failure and recover from them. Avoid single points of failure, reduce the blast radius of any potential pieces failing, and don't forget high availability where it makes sense. Keep the happy path in mind.
I hope that the restaurant metaphor can help to think about reliability holistically when designing your software systems.
Reliability is a spectrum, SRE makes sense of the spectrum
Reliability is not a binary thing that you either have or don't, but rather a spectrum. It's largely a business decision to decide how reliable your system should be. A pacemaker can't really afford to fail, or a plane in-flight system for that matter, but most systems don't work under such high pressure.
Maybe it makes more financial sense to occasionally close our restaurant for a day than to keep extra staff on the payroll. Or if there is high demand for pancakes, maybe it makes sense to expand to a second location, creating a high availability setup for the whole business. Now if the other restaurant needs to be closed due to, say, a fire in the kitchen, the business as a whole can still operate as the other restaurant can serve customers.
The Site Reliability Engineering (SRE) methodology aims to make reliability an explicit goal for the business, and puts emphasis on the "reliable enough" part. When explicit Service Level Objectives (SLOs) are set, those can be used to drive the development efforts in a way that's reliable enough for the business, but not more than that. As discussed it doesn't make sense to invest more money in reliability than it, well, makes sense.
For our pancake restaurant, we defined earlier that we want to be able to serve our customers their pancakes in a reasonable time. Let's make "reasonable time" explicit and set an SLO: we want 99% of our customers to receive their pancakes within 15 minutes of ordering. Let's track this by measuring the time from ordering to the pancakes being served for every order.
If we are lagging badly behind our target, we need to do something about it (or re-evaluate the target itself, but let's assume the target is good for now). We can speed up the orders in many ways, for example by having two chefs and two waiters working at all times. If this takes our SLO to the desired level, we know that this is good enough. By employing even more chefs and waiters we might be able to make the orders even faster, but since we have reached our reliability target (and assuming our target is sensible), it doesn't make sense to invest more.
Wrapping up
Building a reliable system is not about following a step by step guide, but rather about being able to think creatively about your system and the different conditions it can face. There's no clear, generalizable one-to-one mapping between the examples given here and your system, but with creative thinking you should be able to identify the "chefs and waiters" or your system, and more. The examples also serve as direction on how to approach applying the various kinds of reliability mechanisms.
Hopefully this helps you build your system as reliable and resilient.
Featured ones: