Logo

dev-resources.site

for different kinds of informations.

SRE book notes: Introduction to Site Reliability Engineering

Published at
1/10/2023
Categories
sre
books
notes
reliability
Author
bitmaybewise
Categories
4 categories in total
sre
open
books
open
notes
open
reliability
open
Author
12 person written this
bitmaybewise
open
SRE book notes: Introduction to Site Reliability Engineering

Incentivized by my manager at GitLab, Rachel Nienaber, I’m taking notes from the book Site Reliability Engineering, How Google Runs Production Systems, and decided to share some quotes I find more interesting here, and eventually some comments with my thoughts and perspectives as well.

Site Reliability Engineering, How Google Runs Production Systems

This is the first post of a series, so stay tuned. You’re welcome to interact via comments, I’d love to know your thoughts.

Without further ado, here are the notes from the first chapters:


when systems are “reliable enough,” we instead invest our efforts in adding features or building new products.


even though a small organization has many pressing concerns and the software choices you make may differ from those Google made, it’s still worth putting lightweight reliability support in place early on, because it’s less costly to expand a structure later on than it is to introduce one that is not present.


the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.

In my own experience, a seldom trait of companies is to worry about maintenance, be it the quality of the systems, or the cost of keeping everything running.

Do they need a cultural shift? Someone, to defy the status quo? Better prepared professionals? More knowledge? Braveness? A bit of all the previous options?


please bear the SRE Way in mind: thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it.


Hope is not a strategy!

I love this one!

As Murphy’s law states: “If anything can go wrong, it will”


SRE is what happens when you ask a software engineer to design an operations team.

In practice, SREs are also engineers, they do not just maintain and keep the systems running, but they also build them.

More below.


By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.

Therefore, Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc.

Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development.


In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available.

Thus, the marginal difference between 99.999% and 100% gets lost in the noise of other unavailability, and the user receives no benefit from the enormous effort required to add that last 0.001% of availability.


Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.


The book is full of good content for thought. It’s not just about Google. The ideas and practices presented are valuable to all software engineers out there.

I’m enjoying every single chapter. Keep an eye on new publications, because there are more to come regularly as I progress in my reading.


If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.

reliability Article's
30 articles in total
Favicon
SRE Culture Embedding Reliability into Engineering Teams
Favicon
Understanding Idempotency in API
Favicon
Navigating Software Resiliency: A Comprehensive Classification
Favicon
60 Years of the IBM System/360: A Legacy of Reliability and Security
Favicon
Reliability in Legacy Software
Favicon
Azure Site Recovery
Favicon
A simple guide to addressing single point of failure (SPOF) while evaluating external tools
Favicon
How to design Reliable Microservice Chains using the principles of Systems Thinking.
Favicon
Reliability concepts: Availability, Resiliency, Robustness, Fault-Tolerance, and Reliability
Favicon
Lessons in Reliability: Margaret Hamilton's Software Engineering Approach
Favicon
Understanding Observability in Software Distributed Systems
Favicon
Ensuring reliability: SLOs, on-call process, and postmortems
Favicon
Building Resilient Software Architecture: Lessons Learned from the Domino Game
Favicon
10 most important Metrics you must know as a DevOps Engineer
Favicon
10 Most Effective Strategies to ensure reliability of the system
Favicon
Saving 30% on costs and improve infrastructure reliability with profiling
Favicon
"Building Secure and Reliable Systems": How Google's Approach to Security and Reliability Can Benefit Your Organization
Favicon
SLO Anti-Patterns: Real-World Lessons
Favicon
Building Resilient Systems on AWS: Avoiding Common Errors with the Well-Architected Framework
Favicon
SRE book notes: Introduction to Site Reliability Engineering
Favicon
PagerDuty Community Update: November 18, 2022
Favicon
5 key points about Immutable Infrastructure
Favicon
What about off-grid programming?
Favicon
Delivering 100% of Webhooks
Favicon
Observability is becoming mission critical, but who watches the watchmen?
Favicon
Availability Service Level Calculation
Favicon
Reliability Restaurant – How to approach software reliability as a mindset
Favicon
Delinearized Rollouts
Favicon
Submitting Changes
Favicon
Multi-Version Rollouts

Featured ones: