Logo

dev-resources.site

for different kinds of informations.

The Saga of the Replica Lag

Published at
10/31/2023
Categories
replicalag
rds
aws
datadog
Author
viveknayyar
Categories
4 categories in total
replicalag
open
rds
open
aws
open
datadog
open
Author
11 person written this
viveknayyar
open
The Saga of the Replica Lag

Picture this: You’ve just started a new role as a senior backend engineer, and during your first week on call, you’re abruptly awakened at 5 AM by a barrage of alerts. Some scream “Quotation Creation Failure,” others whisper “No More Duplicate Policies,” and another ominously mentions “Severe RDS Lag.” This is the story of an on-call incident, a late-night puzzle that demands your attention.

Step 1: Unraveling the Mystery

  • Upon glancing at your phone, you see an alert description pointing to an error in the quotation service.

Image description

  • Oddly enough, you notice that this issue occurred like clockwork every day at the same late hour around 3–5 AM

Image description

Step 2: The Search for Clues

  • As you sleepily reach for your phone, your first move is to head to the Datadog dashboard, hoping to find insights in APM traces.
  • Within the traces, you stumble upon an enigmatic error message: “invalid character ‘i,” hinting at a potential gateway timeout.

Step 3: Unearthing the Truth

  • Armed with the timestamp from the initial alert, you dive into Datadog’s APM to pinpoint the exact error tracer on quotation-service around 5:22 AM.

Image description

  • You decide to delve deeper into this issue and investigate if any other functions are being invoked within the “create quotation” function. To your surprise, you come across a call to the policy service to create policy, which might have experienced a timeout.

Image description

Step 4: Digging Deeper

  • The next step in your thought process is to determine where to locate the complete request and response to examine the unique identifier for a policy.
  • After narrowing down the cloudwatch logs to the error’s timestamp, you find gold.

Step 5: The Deep Dive

  • Amidst the entries of policy service logs, one of them catches your attention. It was directly linked to the timeout.
  • Surprisingly, the API call didn’t result in a failure, but it had significantly delayed response times resulting in a timeout. This failure caused the customer to retry the operation, resulting in an error indicating a duplicate policy number. This occurred because the policy had already been created in our master database, but the response to the customer had timed out.

Image description

Step 6: Unmasking the Core Issue

  • To unravel the mystery of the sluggish API call, you decide to dig deeper into Datadog’s APM, now focusing entirely on the policy service.
  • Finally, within one of the traces, you pinpoint the culprit: it’s a “select * query” responsible for idempotency checks.
  • But now, you’re puzzled about why a read query would affect your “create policy” call. You examine your master database, and everything seems fine — the CPU and memory usage are all within acceptable ranges, at less than 20%-30%.
  • Perplexed, you begin to scrutinize the code of the policy service and discover that the idempotency check is, in fact, reading from the replica database. You check the CPU usage for the replica, and there it is — the replica database meant to assist is struggling under heavy load.
  • As a result, the “select * query” took nearly 6 seconds to execute.

Image description

Step 7: The Eureka Moment

  • While you’ve now determined why the policy creation timed out, the root cause of why the read replica was lagging remains shrouded in mystery.
  • Driven by curiosity, you decide to investigate AWS RDS monitoring. There, you stumble upon the culprit: a metabase query was running at 5 AM, overwhelming the database.
  • This query induced replication lag, ultimately triggering the alert.

Image description

Image description

Step 8: Resolution and Wisdom Gained

  • Armed with newfound understanding, you take decisive action. First, you stop the troublesome metabase query, putting an end to its havoc.
  • Next, you make a crucial architectural change. You shift the logic for creating policies, which previously relied on the replica database for idempotency checks, to the master database.

Conclusion
This story highlights the kind of challenges that arise at scale. We anticipate sharing more of these challenges in the future.

datadog Article's
30 articles in total
Favicon
Distributed Tracing in Microservices: AWS X-Ray vs DataDog
Favicon
Understanding Datadog: Monitoring and Observability for Modern Applications
Favicon
Golang: Como a observabilidade e profiling revelaram um throttling quase indetectĂĄvel
Favicon
Datadog : how to filter metrics on tag "team"
Favicon
Self-Healing and Monitoring: A comprehensive guide to revolutionizing System Resilience Through Automation
Favicon
Setting Up Datadog Agent for Nginx Log Collection on AWS EC2
Favicon
How to Monitor your AWS EC2/Workspace with Datadog
Favicon
Getting Started with DataDog's APM: A Developer's Guide
Favicon
Integrating Keycloak with Datadog: Enabling Keycloak Traces in Kubernetes using Datadog APM
Favicon
Monitor EC2 instance metrics with Datadog (step-by-step)
Favicon
Datadog PHP APM filtering
Favicon
Datadog Resource Inventory
Favicon
The Saga of the Replica Lag
Favicon
Send the logs of your Shuttle-powered backend to Datadog
Favicon
Monitoring SQS with Datadog
Favicon
Forecast log costs pre-production
Favicon
Datadog-Cloudformation Integration using Serverless Framework
Favicon
Datadog Alternative: When to Move Out?
Favicon
New dog is ready to rock
Favicon
Datadog vs New Relic: A Duel for Dominance in LLM Observability Platforms
Favicon
Quick tip: Monitoring SingleStoreDB Cloud using Datadog
Favicon
Full Stack Observability: Connecting AWS with Datadog
Favicon
Jump into Datadog With AWS Serverless CDK App
Favicon
How to use DataDog to find utilisation of AWS EBS volume
Favicon
Monitorando RabbitMQ implantado no Kubernetes como Cluster Operator com Datadog Autodiscovery
Favicon
Find your application's hidden secrets using opentelemetry
Favicon
Datadog - export more than 5000 records.
Favicon
Send the logs of your Shuttle-powered backend to Datadog (outdated)
Favicon
CockroachDB: trace logging with Datadog
Favicon
CI/CD con Synthetic Monitoring de Datadog y Bitbucket pipelines

Featured ones: