Logo

dev-resources.site

for different kinds of informations.

Software SLA, SLOs and SLIs

Published at
4/3/2021
Categories
observability
sla
slo
sli
Author
kylefoo
Categories
4 categories in total
observability
open
sla
open
slo
open
sli
open
Author
7 person written this
kylefoo
open
Software SLA, SLOs and SLIs

In today's world, people's expectations for free and paid software services are high, including speed, uptime and useful UX. Hence, user base has the right to understand your SaaS availability, quality, and response plans in case a disaster strikes. No one likes to fight over the spoils, but the Service Level Agreement will provide covering in case something goes wrong. Moreover, with system observability in place, the derived service metrics can be used as baseline for setting higher service excellence targets or OKRs.

Now, let's talk briefly about what are Software SLA, SLOs, and SLIs.

SLA

It is a description of what must happen if an SLO is not met. Generally, a service level agreement is a legal agreement between provider and customers and might even include terms of compensation.

Example:

If the service does not provide 99% availability over 1 month, the service provider compensates the customer for every minute out of compliance.

SLA = SLOs + Written & Signed Consequences
Enter fullscreen mode Exit fullscreen mode

See AWS S3's SLA for instance: https://aws.amazon.com/s3/sla/

SLO

It is a scoped objective that engineering team must hit in order to meet the agreement. Here are some considerations while setting it:

  • Identify key metrics (service level indicators — SLIs) from the user perspective, such as availability and latency.

  • Make it measurable – such as 300 ms. latency

  • Allow some space (error budget) such as 300 ms. 99% of the time

  • Be clear on what you promise, for example 99% of the time (averaged over 1 month), HTTP calls that are status 200 completed under 300 ms.

Example, combining the 2 SLIs:

Service responses shall be available 99% and faster than 300 ms for 99% of all valid requests measured over 1 month.

SLO = Availability SLI + Satisfying Latency SLI 
Enter fullscreen mode Exit fullscreen mode

SLI

It is a carefully defined and measurable performance metric, and usually an aggregation of events.

Considering the agreed upon SLO we promised to users, we measure multiple service indicators that attributed to user happiness while using our app.

Example: possible definitions of SLIs for the “search” interaction might be as follow:

Alt Text

Notice that we only cover Availability and Latency types of measurable SLIs here, they are others depending on the nature of your endpoints. See: https://sre.google/workbook/implementing-slos/#slis-for-different-types-of-services

Summary

Measuring your software quality is an on-going process, simply because your software evolves over time. Find time to sit down with the stack-holders at your company to go over the numbers, know where you are with the metrics and explore how to improve those indications. Instead of pointing fingers when things go south, why not being data-driven from the start, making sure software is reliable while engineers continuously shipping features.

References:

https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sli

https://cloud.google.com/blog/products/devops-sre/sre-fundamentals-slis-slas-and-slos

Featured ones: