Logo

dev-resources.site

for different kinds of informations.

What can we learn from the Facebook outage?

Published at
10/11/2021
Categories
facebook
postmortem
outage
humanerror
Author
jhall
Author
5 person written this
jhall
open
What can we learn from the Facebook outage?

If you’re like me, you may not have noticed the Facebook/Instagram/WhatsApp outage first-hand. But you’re probably not like me, and you probably found this outage to be a personal nuisance.

Now that things are returning to normal, Facebook has given us a small glimpse into what happened:

Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication.

Ah! Human error. Those pesky humans. I hope they learned their lesson.

Not so fast.

It may be easy and tempting to point to the human (or group of humans) who made a configuration error, and call it a day. But at the end of the day, the vast majority of technical failures, be it in IT systems, aircraft accidents, automobile accidens, or burned cupcakes, come down to human error. If we end our investigation there, we’ll never really improve.

So if the human who made the configuration mistake is not to blame, who or what is?

Here are a series of questions you can ask next time you’re faced with this delimma, to put you on the track to a more “human-proof” system:

  • Why did the system allow a human to make an erroneous configuration change?
  • Why was a human error able to have such a broad impact?
  • What safeguards can we put in place to prevent such errors from occurring?
  • What systems can we put in place to detect such errors before they cause a catastrophic failure?
  • What backups can we put in place so that when there’s a similar failure, we can continue to operate?
  • How can we improve the system so that we can detect such failures more quickly in the future?
  • How can we recover from such failures more quickly next time?

I’m sure you can use your imagination to double or tripple this list. The point is: Even when human error is involved (and it usually is), that should never end your investigation, or be considered the root cause. Do a blameless postmortem, and solve every problem twice.


If you enjoyed this message, subscribe to The Daily Commit to get future messages to your inbox.

postmortem Article's
29 articles in total
Favicon
Postmortem: The Popcorn Panic
Favicon
How I stopped RSpec from spiking to 2x runtime
Favicon
The Day the Web Stood Still: A Firewall Configuration Catastrophe
Favicon
Why I decided to get bad grades in college
Favicon
Zuri Booking Engine Outage - Incident Report and Recovery Analysis
Favicon
Postmortem: Outage Incident on Thavmasios Online Store
Favicon
Postmortem: Nginx Server Failure
Favicon
SRE book notes: Postmortem Culture
Favicon
Postmortem reports: How to get the most from failure for massive growth
Favicon
Post-mortem: 1h30 downtime on a Saturday morning
Favicon
Incident report (Postmortem)
Favicon
What can we learn from the Facebook outage?
Favicon
Retrospectives or postmortems?
Favicon
Where to start with DevOps
Favicon
Incident Retro: Failing Comment Creation + Erroneous Push Notifications
Favicon
Hidden dependencies and the Fastly outage
Favicon
Gamedev.js Jam 2021 post mortem
Favicon
How to do a postmortem without any preparation
Favicon
A Star Trek Postmortem
Favicon
Duplicate Digest Email Incident Retro From January
Favicon
Post-Mortem: Outbreak Database
Favicon
What I've learned from my 2nd Game | Teddy's Crew
Favicon
What I’ve learned from my first game | R0d3nt
Favicon
Project Nodetree recap ~ AoaH Eight
Favicon
Postmortem of Incident on 08 June 2020
Favicon
Postmortem of Root Certificate Expiration from 30 May 2020
Favicon
40,000+ Users in 3 months... Story of a Product I built
Favicon
Post-Mortem: LinkedIn Talent Intelligence Experience
Favicon
Maximize learnings from a Kubernetes cluster failure

Featured ones: