Logo

dev-resources.site

for different kinds of informations.

Zuri Booking Engine Outage - Incident Report and Recovery Analysis

Published at
1/23/2024
Categories
postmortem
Author
injili
Categories
1 categories in total
postmortem
open
Author
6 person written this
injili
open
Zuri Booking Engine Outage - Incident Report and Recovery Analysis

On the 20th of December 2023, the Zuri Booking Engine experienced a downtime. Here, is the postmortem. We sincerely apologize to all our valued guests who were inconvenienced by this disruption. We empathize with the challenges you faced during this period and assure you that we are committed to implementing measures to prevent such occurrences in the future.

Issue Summary

Commencing at 14:47 hours EAT, 100% of guests attempting to make reservations and complete transactions through the website were redirected to a 500 Internal Server Error page, failing to receive the SDK push required for transaction authorization. Notably, users retained access to other site components, allowing them to view information unrelated to reservations. The root cause of this outage traces back to a recent update implemented a few minutes prior, which inadvertently modified the routing and handling mechanisms for incoming requests. We extend our sincere apologies for any inconvenience caused and are committed to both rectifying the issue promptly and implementing measures to prevent such occurrences in the future. By 16:15 hours, the issue had already been resolved and everything was working alright.

Timeline (East African Time)

  • 14:34 hours - An update pushed to production.
  • 14:47 hours - A customer sent a complaint.
  • 14:47 hours - The engineering team was notified.
  • 15:34 hours - Successful configuration change rollback. The Booking engine was rolled back.
  • 15:35 hours - The engineering team focused on rectifying the update pushed earlier.
  • 16:13 hours - A rectified update was pushed.
  • 16:15 hours - Status page updated to "Resolved".

Root Cause

The underlying cause of the incident stemmed from the deployment of untested code into the production environment. Following the completion of the final component for the booking engine, the responsible engineer conducted testing and introduced some modifications before initiating the code push. Regrettably, the deployed code resulted in the generation of incomplete requests to the backend, triggering an error 500 in the response from the route. While the database underwent updates, the SDK push failed to obtain authorization. Recognizing the issue, an immediate system rollback was executed. Subsequently, the engineering team initiated a rollback of the database and implemented code patches to facilitate the submission of valid requests. Rigorous testing, encompassing various scenarios, was then conducted to ensure the security and efficacy of the revised code. Following successful validation, the corrected code was pushed as the definitive update.

Corrective and Preventative Measures

A comprehensive system analysis conducted the following morning enabled us to formulate the subsequent steps to address potential recurrences of the issue:

  • Execute a rollback to revert to the originally utilized code.
  • Address the underlying issue in a systematic, section-by-section manner.
  • Thoroughly test the patched code for validation.
  • Progress to deploy the updated website with the implemented patches.
  • Integrate monitoring mechanisms for requests on the most critical routes to enhance proactive issue detection.
postmortem Article's
29 articles in total
Favicon
Postmortem: The Popcorn Panic
Favicon
How I stopped RSpec from spiking to 2x runtime
Favicon
The Day the Web Stood Still: A Firewall Configuration Catastrophe
Favicon
Why I decided to get bad grades in college
Favicon
Zuri Booking Engine Outage - Incident Report and Recovery Analysis
Favicon
Postmortem: Outage Incident on Thavmasios Online Store
Favicon
Postmortem: Nginx Server Failure
Favicon
SRE book notes: Postmortem Culture
Favicon
Postmortem reports: How to get the most from failure for massive growth
Favicon
Post-mortem: 1h30 downtime on a Saturday morning
Favicon
Incident report (Postmortem)
Favicon
What can we learn from the Facebook outage?
Favicon
Retrospectives or postmortems?
Favicon
Where to start with DevOps
Favicon
Incident Retro: Failing Comment Creation + Erroneous Push Notifications
Favicon
Hidden dependencies and the Fastly outage
Favicon
Gamedev.js Jam 2021 post mortem
Favicon
How to do a postmortem without any preparation
Favicon
A Star Trek Postmortem
Favicon
Duplicate Digest Email Incident Retro From January
Favicon
Post-Mortem: Outbreak Database
Favicon
What I've learned from my 2nd Game | Teddy's Crew
Favicon
What I’ve learned from my first game | R0d3nt
Favicon
Project Nodetree recap ~ AoaH Eight
Favicon
Postmortem of Incident on 08 June 2020
Favicon
Postmortem of Root Certificate Expiration from 30 May 2020
Favicon
40,000+ Users in 3 months... Story of a Product I built
Favicon
Post-Mortem: LinkedIn Talent Intelligence Experience
Favicon
Maximize learnings from a Kubernetes cluster failure

Featured ones: