Logo

dev-resources.site

for different kinds of informations.

Production incidents - 7 practical tips to help you through your next incident

Published at
9/18/2022
Categories
monitoring
tutorial
production
incidents
Author
liorhalfon
Author
10 person written this
liorhalfon
open
Production incidents - 7 practical tips to help you through your next incident

When your product or service is in downtime you lose more than just money. You are losing the trust of your users and partners.
Therefore being proactive and doing what you can to prepare for production incidents is essential.

Here are 7 pieces of advice to help you and your team prepare and deal with production incidents when you encounter them.

Before the incident:

#1 Donā€™t let error logs fall through the cracks
You probably use some kind of a logger in your service to log errors and other informative data, If not, itā€™s time to start!
For example, when an exception is raised, a lot of developers write something like this:

try {
  //  Block of code to try
}
catch(Exception e) {
  logger.error("Got an exception, context: %s", e, context);
}
Enter fullscreen mode Exit fullscreen mode

If your service log is swelled with unknown exceptions you should be worried, because bad things are happening and users may be impacted.
Therefore, you should monitor any error logs in your service and fire alerts.
Some errors may not be so bad and some may be brutal, you should decide for yourself which ones you want to get alerted for.

The ELK Stack is currently the most popular log management and logs analysis platform in the market used for monitoring.
You can run a periodic query on top of Elastic Search to find undesired errors, and fire alerts when they happen.

#2 Catch bad behaviors early by adding statistics to your service.
Understanding the ā€œstatusā€ of your service is essential for ensuring its reliability and stability. Detailed information and high visibility of the processes in your service not only helps your team react to issues but also gives them the confidence to make changes.

One of the best ways to gain this ā€œstatusā€ insight is with a robust monitoring system that gathers metrics, visualizes data, and fires an alert when things appear to be broken.
A robust monitoring and alerting system will help you solve issues sooner and will minimize the damage done by the incident.

Similarly to logs, you can configure alerts on your service based on statistics such as latency, queue lengths, CPU, memory usage, and so on.
Prometheus and Grafana are among the most popular monitoring and alerting tools.
You can also set up PagerDuty to make sure you always have an owner that handles incoming alerts.

When configuring metrics and alerts, focus on the service purpose. What is its job? For example, If itā€™s serving HTTP requests, then an important metric to set up is the HTTP return status code count.
If a service job is to write something to a database, then an important metric to have can be the rate of writes to the database.

#3 Prepare runbooks and invest in services documentation.
Itā€™s possible that when things break down the owner/expert of the broken service will not be available, or the person dealing with the incident might be clueless regarding how to deal with the incident. Therefore, itā€™s a good idea to have a prepared response plan for dealing with emergencies. (Here are some best practices to write one)

The benefit of these prepared response plans (or runbooks) is you donā€™t need an expert to be available every time there is a problem. This in turn reduces the burnout feeling when the same person has to deal with the same problems over and over again.
If you can automate these plans itā€™s even better. For example, restart the service automatically if the health check is failing.

The downside with response plans is they are written for a specific case. And you can never cover all the cases.
So it might be wise to train your incident responders on generic action taking, like finding a log or rolling back a deployment.

Another important aspect of this is "service documentationā€. It doesnā€™t have to be super detailed, but just enough to give the incident responder entry points to the service and critical things to know about it.

During an incident:

#4 ā€œRead the damn error messageā€
The first thing you should do when you want to solve the issue at hand is to understand what exactly went wrong.
For example, If you get an error message from the log, read it carefully and look for details like: ā€œWhat just happenedā€, ā€œWhere it happenedā€, and only then try to think about ā€œWhy it happenedā€ and what you should do about it.

So, the error messages in your log are valuable. This is why they should be as detailed as possible, and add important data, so anybody who reads them will know the ā€œwhatā€ and the ā€œwhereā€ immediately. (Stack traces are great for the ā€œwhereā€ part)

#5 Use data to understand the issue
When you encounter a production issue, your first instinct shouldn't be to guess and immediately jump to a conclusion and try to fix the problem, which can lead to wasted time and resources. Instead, you should use the data you have at hand to devise a hypothesis and then validate that hypothesis with hard data. Think about it like playing detective.

There are two ways to go about this: induction or deduction.
With induction, you locate the relevant data, organize it, and then devise a hypothesis. Once you have a hypothesis, you can use data to prove it and then fix the problem.
Deduction works similarly but involves enumerating the possible causes or hypotheses first and then eliminating the ones that don't fit with the data. This allows you to refine your remaining hypothesis until you arrive at a conclusion that can be proved with data.
Either way, data should be at the heart of your efforts to solve the issue.

#6 Aviate, Navigate, Communicate
One of the first things pilots learn at ground school is what to do in case something goes wrong with the aircraft - ā€œAviate, Navigate, Communicateā€.
Aviate means keeping the plane in the air. Navigate - to a safe location. And lastly, Communicate - let ground control know you are in trouble so they can help.

We can apply the same principles used by pilots to the software domain too.
In cases such as a complete downtime of our system (but not only):

  1. Aviate - Do everything you can to bring the system up first. While saving data so you can analyze the root cause later, like taking a thread dump for example.
  2. Navigate - Keep making decisions even when you lack complete information (avoid Analysis paralysis).
  3. Communicate - Inform everyone that there is an issue, maybe they can help.

Credit to Barak Luzon And Ariel Pizatsky who mentioned these principles in an excellent presentation they gave - When The Firefighters Come Knocking

#7 Avoid psychological biases and pitfalls
Our brain is wired to make decision-making simpler. In doing so, it exposes itself to biases, heuristics, and other quirks that may seem like ā€œbad decisionsā€ in hindsight.
One such example is the ā€˜simulation heuristicā€™. The simulation heuristic is a psychological heuristic, or simplified mental strategy, according to which people determine the likelihood of an event based on how easy it is to picture it mentally.
I.E You may think you know the incident's root cause simply because itā€™s easier for you to imagine it.
Another example is the ā€œconfirmation biasā€. The confirmation bias is the tendency to search for, interpret, favor, and recall information in a way that confirms or supports one's prior beliefs or values.
For instance, you are more likely to find evidence that supports your existing hypothesis, and ignore evidence that disproves it.
So what can you do to avoid these human biases? Itā€™s hard to say, and I donā€™t think there is a silver bullet here. But being aware of them is the first step to mitigating them.
I encourage you to watch a great talk by Boris Cherkasky that dives into how psychological biases affect incident response.

Wrapping up

Investing in a resilient monitoring and alerting system will result in more confidence when performing actions and deploying features on the system.
Training your team to react to disasters, will remove the fear of breaking things.
As a result, you may increase the development velocity of the now ā€œfearlessā€ developers.

production Article's
30 articles in total
Favicon
The Making of the Zip Ship Hi-Tech Ultimate Go-Cart Indiegogo Campaign Video
Favicon
Synchronize Files between your servers
Favicon
Dulces Suenos Spanish Pop (Sample Packs)Download
Favicon
PostgreSQL fĆ¼r django aufsetzen - django in Produktion (Teil 2)
Favicon
Industrial Juicers: Enhancing Juice Production Capabilities
Favicon
Cloudflare Tunnels VS ngrok
Favicon
In Laravel, always use the env() within config files and nowhere else.
Favicon
How to Set Up Multiple PostgreSQL Instances on a Single Server
Favicon
Use same Dockerfile for Dev & Production
Favicon
Integrating Vite with Flask for Production
Favicon
Everybody Dumps Production At Least Once
Favicon
The Dangers of Using the Same Database for Development and Production
Favicon
Dev Deletes Entire Production Database
Favicon
Mastering Chrome DevTools: Edit production code on-the-fly in your browser āœļø
Favicon
Best way to run Migrations in Production
Favicon
Why should you use a hidden replica set member
Favicon
Software upgrade checklist in production
Favicon
Running CockroachDB on k8s - with tweaks for Production
Favicon
Where engineering and creative production worlds clash!
Favicon
Increasing Product Release Velocity by Debugging and Testing In Production
Favicon
Next.js in Production: Best Practices and Common Pitfalls
Favicon
Deploy a containerised Fast API application in Digital Ocean
Favicon
Production incidents - 7 practical tips to help you through your next incident
Favicon
Fix Page not found error when visiting a route directly in react
Favicon
AWS Amplify - Deploy your application in minutes.
Favicon
Trying Streamyard for various things
Favicon
How to deal with data changing and machine learning models doing worse after training
Favicon
[BTY] Day 10: Real-time machine learning: challenges and solutions - Huyen Chip
Favicon
Installing Gem in Production Rails console
Favicon
Production-Ready Docker Configuration With DigitalOcean Container Registry Part I

Featured ones: