dev-resources.site
for different kinds of informations.
Error Budgets and their Dependencies
Does your team struggle with not having balanced error budget, that impacts your reliabilty & pace of innovation? Adam Hammond in his latest blog talks about error budget - accountable for planned & unplanned outages that your systems may encounter & how teams can calculate error budget efficiently.
In our last few articles, weāve discussed SLOs and how important picking them correctly can make or break for your applicationās performance. Today weāre going to cover error budgets, which are used to account for planned and unplanned outages that your systems may encounter. In essence, error budgets exist to cover you when your systems fail and to allow time for upgrades and feature improvement. No system can be expected to be 100% performant, and even if it were, you need to have time available for maintenance. Activity like database major version upgrades can cause significant downtime when they occur. Error budgets allow you to plan ahead and put aside time for your team to manage their services while providing customers with lead time so that they can plan for the downstream impacts of your service going offline.
An introduction to service calculations
There is an easy trap to fall into when it comes to determining your error budgets. Calculating your error budgets - as with everything in regards to process improvement - is a journey. Most people would usually say āwell, my error budget is simply the left-over time once my SLO is taken awayā and that formula for them might look like this:
Error Budget = 100% - Service SLO
However, this is incorrect and is āstarting at the endā. This is definitely your aspirational error budget, but it doesnāt take into account your serviceās current performance and what the current state of your serviceās error budget is. The initial equation for your error budget is as follows:
Error Budget = Projected Downtime + Projected Maintenance
If you remember from our previous article on SLOs, we need to do a lot of research into understanding factors like how performant our customers expect our system to be, but another part of that is, understanding maintenance and existing application error rates. The projections will most likely track very closely to your past performance, unless your serviceās performance has been widely variable in the past. When you first define your error budget, it is acceptable to baseline it against what your service can currently provide. If you can only deliver an SLO of 85%, there is no point promising 90%. However, once you have established your baseline error budget, you must never allow it to move below your starting point. Error budgets decrease, they do not increase. The first port of call for most organisations when implementing their error budget is to focus on maintenance as you usually get the best ābang for buckā; there are usually processes that can be improved or better software versions to be installed. This is where your SRE teams come in to help deliver streamlined, automated, and focused software pipelines that minimise application downtime. Move away from manual, labour intensive processes and single-click developer experiences to minimise intentional error budget usage.
The point of error budgets it to allow you to focus on where your product improvement hours are spent. New features can be implemented if you have not utilised your error budget, consider service improvement if it is nearly consumed, and you absolutely must focus all resources on stabilizing your service if your error budget is in deficit. Ultimately, an error budget is designed to help you understand where you should focus your engineering resources to ensure your SLOs are met. The final stage of our error budget baseline is to compare it against the SLO that we intend to maintain for our service. We can do this by simply reverting to our calculation from the beginning:
Expected Service SLO = 100% - Error Budget
It is at this point, that you can determine the immediate direction you need to take in regards to service improvement. If your error budget is running higher than expected, you should focus on reducing it. Once youāve completed your initial service improvements to bring your error budget into line (if any was required), you can then finally use the āsimpleā calculation to determine your error budgets:
Error Budget = 100% - Service SLO
The important thing to note is that things like customer expectations serve as minimums in terms of SLOs, so we donāt include them in our initial calculations. At the beginning of our error budget journey, we are understanding our current state and in a lot of cases, it is probably less than the desired target. Another key aspect to keep in mind is that if our SLO performance is ever less than the minimum, then we need to reduce our actual error budget via service improvement as soon as possible.
What is downtime, really?
In our calculations, we separated our downtimes into two categories: unexpected and maintenance. To properly calculate our error budgets, we need definitions for what ādowntimeā is, in general, and then we also need to differentiate between the two categories. For our purposes, a suitable definition for downtime is āsystems are not in a state to meet the required metricā. This specifically targets the SLO and itās associated metric.
We then further define our two categories, with āmaintenance downtimeā being ādowntime caused by an intentional disruption due to system maintenanceā and āunexpected downtimeā simply being āall other downtimeā. We differentiate between these two types of downtime not specifically to build the error budget, but to provide us with guidance on how we can improve them. For example, if we want to reduce maintenance we need process improvement, but if we want to reduce unexpected downtime we probably need to fix bugs or errors within our services. These categories provide strategic guidance on where we need to look for potential error budget savings when we need to deliver better service to our customers.
Calculating our error budgets
Now that we have all of our required definitions and formulas, now itās a simple process to actually calculate our error budgets. In fact, a quick visit to our maintenance procedures and our metrics dashboard should suffice:
- Determine our total downtime by retrieving our current monthly error rates from our metric dashboards.
- Find out how much downtime is scheduled for our maintenance each month.
- Calculate our unexpected downtime amount by subtracting scheduled downtime from actual error rates.
Now we have our three metrics: total downtime, maintenance downtime, and unexpected downtime. Now, letās return to Bill Palmer at Acme Interfaces, Inc for a practical look at how effective error budgets can be, and how we can use all of this information to calculate them appropriately.
āHelp, Bill! The system is too slow!ā
Bill Palmer sat at his desk, exasperated. Acme Interfaces had been putting off their database upgrade for years. He received an email from their cloud provider today, advising that the database would be upgraded forcibly if no action was taken in the next four months. Coming in at 15TBs and feeding into over 500 interfaces, their database was at the heart of the business. As part of the upgrade, everything would need to be tested along with the actual upgrade itself. Bill required hours for the upgrade, but it actually looked like Acme Interfaces was going over their error budget by a few minutes every month. Now that their cloud provider had forced their hand, something needed to be done.
He pulled up an excel spreadsheet with service metrics and began looking for places for a quick win.
Within a few minutes, heād found what looked like the root of their error budget deficit. Looking at the error reporting for HTTP requests over the last year at Acme Interfaces, relatively simple requests were returning HTTP 50X errors at quite large volumes and for unknown reasons. Heād made a promise to Dan that heād get the error rate lower than 10% to get the error budget back in surplus for the upgrade; it was time to get to work. He looked at the detailed statistics and noticed that about half of the errors were 503s and 504s, and the other half were 500s. He just didnāt understand how there could be so many transport errors.
He picked up his phone and dialed the NOC.
āRing, Ringā¦. Ring, Ringā¦. Hello, Acme NOC, this is Charlie.ā
āGāday Charlie, this is Bill, the CTO, do you have a few minutes to discuss some statistics Iām reviewing.ā
āSure, Bill.ā
āExcellent. Iām just taking a look at our HTTP error codes for the past year and for some reason we return a lot of bad gateway and service unavailable errors, do you know why that would be the case?ā
āSure do, Bill. Our load balancer software is on a really old version. Itās got a bug, where it hits a memory leak and wonāt be able to parse requests back from the backend servers. Thatās what throws the 502s. After a few minutes, the server will restart but because it is our load balancer we canāt easily take it out of service so we return 503s. We used to have to manually restart the servers, but we implemented a script that checks for health and can reboot within a few minutes.ā
Bill paused for a few moments. ā...Is there a reason why the infrastructure team hasnāt upgraded to a new version of the load balancer?ā
āWell, thatās the problem, we donāt really have anyone dedicated to the load balancers. They were setup up a few years ago as part of a project, and now the NOC just fixes them up when they go a bit crazy. The vendor has confirmed that the newer version of the software doesnāt have the bug but we just donāt have the expertise to manage that at the moment. We also restart them all at night which takes about an hour which would cause 503s.ā
āOkay, well thanks for the information, Charlie. Iāll see what we can do. clickā
Bill started to write up all the information he had gained from the phone call with Charlie.
After he was done, he called Jenny.
āJenny, can you please do me a favour and find out how much a System Administration course for our Load Balancing software would be, please?ā
āSure, is this about those HTTP errors?ā
āYou know it!ā
Bill continued to look at the whiteboard, and just knew the fastest way to improve performance would be to bring the load balancer up to scratch, and get the NOC team up-skilled to handle these systems. Theyāve been improving these systems in spite of not having any official training, so they definitely are great operators.
Billās phone rang, āBill, itās Jenny. I just got off the phone with them and they said they could do a 20% discount on the training with a group larger than 10 people and that it would be $10,000 a head.ā
āOkay, get back to them and book in two sessions of 15 people each. I want the whole NOC to be up-skilled on the Load Balancer immediately. Draw up a project proposal for shift-left knowledge transfer from some of the application teams as well as SRE development for the NOC team. Their skills are wasted waiting for fires to break out, I know they can get this environment up to where it needs to be.ā
āSounds good, Iāll get onto it now!ā
Bill surveyed the room, taking in the hundreds of leaders from across Acme Interfaces, as he prepared to talk about his teamās development over the last six months.
āHi everyone, Iām sure most of you know me by now, but Iām Bill, the new-ish CTO. Today Iām going to be talking about how we were able to eliminate a major barrier to our database upgrade by analysing and refining the error budgets for our HTTP requests.ā
āSix months ago, we were seeing an error rate on HTTP requests of up to 15% per month which was well above our expected error budget of 10%. About 5% of these were caused by application errors, but 8.5% of these errors were being seen at the load balancer and were due to availability issues. We wanted our error budget to be 10% or less request errors, but we were tracking 5% above that. We had to improve something if we wanted to meet that target.ā
āI got onto the NOC and spoke with Charlie who enlightened me to some issues we were having with our load balancer: it hadnāt been updated for a few years and a bug was causing all these errors. Further exacerbating the issue, no one with the skills to actually upgrade the load balancer worked at the company so that wasnāt an immediate option.ā
āJenny got onto the vendor and arranged training for the entire NOC. Within three weeks they were all skilled up, then we began our project to upgrade the load balancers. With everyone skilled up, it only took us two weeks to upgrade all of the servers and we were able to do this during downtime that was previously reserved for maintenance (otherwise known as restarting servers due to the bug). Weāve also begun transitioning all of the existing NOC operators to new SRE-based roles that will allow them to assume greater responsibility for the improvement of our core infrastructure.ā
āWithin two months of defining our current state error budget, we had used them to identify where our issues were coming from, resolved those issues, and, now weāve been able to meet (and exceed) our target of less than 10% HTTP request errors. Weāve also used the experience to refine our NOC and give our staff greater responsibility.ā
āIād heartily recommend everyone has a look at the internal error budgets that you are responsible for, as I am very sure that it can only have positive outcomes for the business. Thanks for attending my session, and I hope the rest of the retreat goes well.ā
Squadcast is an incident management tool thatās purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.
Featured ones: