Logo

dev-resources.site

for different kinds of informations.

2x Faster, 40% less RAM: The Cloud Run stdout logging hack

Published at
11/24/2024
Categories
cloud
googlecloud
performance
sre
Author
siddhantkcode
Categories
4 categories in total
cloud
open
googlecloud
open
performance
open
sre
open
Author
13 person written this
siddhantkcode
open
2x Faster, 40% less RAM: The Cloud Run stdout logging hack

Sometimes, the simplest solutions yield the most dramatic improvements. In a recent private project, I discovered that changing how I handled logging in Cloud Run led to significant performance gains and memory savings. Here's the full story of what happened, what I learned, and how you can implement these improvements in your projects.

High level diagram

The setup

I've been running a Node.js (v18.x) application with NestJS (v10.x) on Cloud Run for about 8 months now, handling around 50k requests per day. I chose @google-cloud/logging-winston (v6.0.0) for logging because of its promised seamless integration with Google Cloud's ecosystem. Initially, I was using the default configuration, which directly calls the Cloud Logging API. Everything seemed fine at first glance - the logs appeared in Cloud Logging, and the application was running. But then things got interesting.

The mystery of the memory spikes

My application was configured with 1 vCPU and 4 GiB RAM, which should have been plenty. However, I noticed some concerning patterns in my metrics:

  • Memory usage would occasionally spike to 100%
  • CPU utilization would hit maximum capacity during these spikes
  • Container instance time was higher than expected

When these memory spikes occurred, I started seeing errors like:

Error: Rating exceeded quota for quota metric 'Write requests' and limit 'Write requests per minute' of service 'logging.googleapis.com'
Enter fullscreen mode Exit fullscreen mode

And the dreaded Cloud Run timeout messages:

Error: The request has been terminated because it has reached the maximum request timeout.
Enter fullscreen mode Exit fullscreen mode

While these metrics were concerning, the application seemed to be functioning normally most of the time, so I initially assigned it a lower priority for investigation. That was until I encountered something that made me sit up and take notice.

The three-hour log delay incident

One day, while monitoring my application, I noticed something bizarre. An alert was triggered based on a log condition, but when I investigated, I discovered the log was from more than three hours ago!

Here's what the delayed log looked like:

{
  "severity": "ERROR",
  "message": "Database connection failed",
  "timestamp": "2024-10-19T10:15:00.000Z",
  "receiveTimestamp": "2024-10-19T13:15:03.421Z",
  // Over 3 hours difference!
}
Enter fullscreen mode Exit fullscreen mode

This wasn't just a minor inconvenience - it had serious implications:

  • Log-based alerts were delayed, making them less useful for real-time monitoring
  • Backed-up logs were consuming memory
  • Some instances were failing to accept requests with timeout errors

During investigation, these commands were invaluable:

# Check container logs for OOM issues
gcloud logging read "resource.type=cloud_run_revision AND textPayload:\"OOMKilled\""

# Monitor memory usage in real-time
gcloud run services describe my-service --format='get(status.conditions)'
Enter fullscreen mode Exit fullscreen mode

The root cause analysis

After investigation, I pieced together what was happening:

  1. Logs weren't being sent immediately to Cloud Logging
  2. They were accumulating in memory
  3. Eventually, there wasn't enough memory to even handle new requests
  4. The application would start throwing timeout errors

The worst part? If an instance terminated while logs were stuck in memory, those logs would be lost forever. It's not ideal for observability!

Root cause analysis diagram

The Solution: stdout to the rescue

The fix turned out to be surprisingly simple. Instead of using the Cloud Logging API directly, I modified the configuration to write logs to standard output. Cloud Run automatically forwards stdout to Cloud Logging, making this a more efficient approach.

stdout solution diagram for cloud run logs

Here's how to implement it:

import { LoggingWinston } from "@google-cloud/logging-winston";
import { createLogger } from "winston";

const logger = createLogger({
  transports: [
    new LoggingWinston({
      // This redirects logs to Cloud Run's stdout
      redirectToStdout: true,
      // Default is true, affects log structure.
      // Prevents wrapping logs unnecessarily
      useMessageField: false,
    }),
  ],
});
Enter fullscreen mode Exit fullscreen mode

The results were crazy!!!

After implementing this change, the improvements were immediate and significant:

Memory metrics:

  • Before:

    • Peak memory usage: 3.8GB of 4GB (95%)
    • Average memory usage: 3.2GB (80%)
    • Frequent OOM risks
  • After:

    • Peak memory usage: 1.6GB (40%)
    • Average memory usage: 1.2GB (30%)
    • Zero OOM incidents in 30 days

Performance metrics:

  • p95 latency decreased from 800ms to 300ms
  • Error rate dropped from 0.5% to 0.02%
  • Average CPU utilization dropped from 85% to 40%

Cost impact:

  • 40% reduction in Cloud Run instance hours
  • 35% decrease in logging costs
  • Estimated annual savings: $2,400

I validated the improvement with a load test:

  • Tool: Artillery
  • Duration: 30 minutes
  • Load: 0 to 500 RPS ramp-up
  • Payload: 1KB JSON messages

The stdout version handled 2x the load with half the memory usage.

The blue line in the center of the following images indicates when the change was applied.

Memory usage

CPU usage

Billed container instance hours

Important considerations for structured logging

When using stdout logging with Cloud Run, there are some crucial points to keep in mind:

  1. Handle null values in labels carefully: If any label contains null, it gets stored in a jsonPayload field under logging.googleapis.com/labels. I solve this by converting all label values to strings before logging.

  2. Timestamp precision: Always pass timestamps as Date objects. Otherwise, the logger might override your timestamp with its processing time.

  3. Trace configuration: For request tracing, set up the X-Cloud-Trace-Context header value in your logs. This lets you filter logs by request, which is invaluable for debugging. Learn more in official docs

  4. Instance identification: Cloud Run automatically includes labels.instanceId, which is extremely useful for debugging instance-specific issues.

Alternative approaches I considered

Before settling on stdout logging, I evaluated several other options:

  1. Optimizing API calls:

    • Adjusting API call frequency
    • Batch processing logs
    • Implementing retry logic
  2. Using logging agents:

    • Exploring the Logging agent (though it's no longer in active development)
    • Investigating the Ops agent (more suited for Compute Engine)
  3. Custom buffering solutions:

    • Implementing an in-memory buffer with size limits
    • Creating a separate logging microservice

The stdout approach won out due to its simplicity, effectiveness, and built-in support in Cloud Run.

Have you encountered similar logging challenges in your Cloud Run applications? I'd love to hear about your experiences and solutions in the comments below!


For more tips and insights, follow me on Twitter @Siddhant_K_code and stay updated with the latest & detailed tech content like this.

sre Article's
30 articles in total
Favicon
In 2025, I resolve to spend less time troubleshooting
Favicon
Observability Unveiled: Key Insights from IBM’s SRE Expert
Favicon
SSH Keys | Change the label of the public key
Favicon
Rely.io Update Roundup - December 2024
Favicon
From Ancient Firefighters to Modern SREs: Balancing Proactive and Reactive Work with Callgoose SQIBS Automation
Favicon
AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities
Favicon
Automation for the People
Favicon
we are doing DevOps job market Q&A with folks from Google, AWS, Microsoft etc.
Favicon
SRE for the SaaS
Favicon
Rely.io October 2024 Product Update Roundup
Favicon
The Pocket Guide to Internal Developer Platform
Favicon
How to Configure a Remote Data Store for Prometheus
Favicon
Day 10: ls -l *
Favicon
Why does improving Engineering Performance feel broken?
Favicon
Incident Management vs Incident Response: What You Must Know
Favicon
Retry Pattern: Manejando Fallos Transitorios en Sistemas Distribuidos
Favicon
Top Backstage alternatives
Favicon
The Vital Role of Human Oversight in AI-Driven Incident Management and SRE
Favicon
The Role of External Service Monitoring in SRE Practices
Favicon
Looking for an incident management tool?
Favicon
Rely.io October 2024 Product Update Roundup
Favicon
A Very Deep Dive Into Docker Builds
Favicon
SRE Culture Embedding Reliability into Engineering Teams
Favicon
Check out our new whitepaper: "Internal Developer Platforms and Portals, a complete overview"
Favicon
Control In the Face of Chaos
Favicon
2x Faster, 40% less RAM: The Cloud Run stdout logging hack
Favicon
Understanding and Minimizing Downtime Costs: Strategies for SREs and IT Professionals
Favicon
SRE vs DevOps: What’s the Difference and Why Does It Matter? πŸ€“
Favicon
Rely.io September 2024 Product Update Roundup
Favicon
Best Practices for Choosing a Status Page Provider

Featured ones: