dev-resources.site

for different kinds of informations.

2x Faster, 40% less RAM: The Cloud Run stdout logging hack

Published at

11/24/2024

The setup

I've been running a Node.js (v18.x) application with NestJS (v10.x) on Cloud Run for about 8 months now, handling around 50k requests per day. I chose @google-cloud/logging-winston (v6.0.0) for logging because of its promised seamless integration with Google Cloud's ecosystem. Initially, I was using the default configuration, which directly calls the Cloud Logging API. Everything seemed fine at first glance - the logs appeared in Cloud Logging, and the application was running. But then things got interesting.

The mystery of the memory spikes

My application was configured with 1 vCPU and 4 GiB RAM, which should have been plenty. However, I noticed some concerning patterns in my metrics:

Memory usage would occasionally spike to 100%
CPU utilization would hit maximum capacity during these spikes
Container instance time was higher than expected

When these memory spikes occurred, I started seeing errors like:

Error: Rating exceeded quota for quota metric 'Write requests' and limit 'Write requests per minute' of service 'logging.googleapis.com'

And the dreaded Cloud Run timeout messages:

Error: The request has been terminated because it has reached the maximum request timeout.

While these metrics were concerning, the application seemed to be functioning normally most of the time, so I initially assigned it a lower priority for investigation. That was until I encountered something that made me sit up and take notice.

The three-hour log delay incident

One day, while monitoring my application, I noticed something bizarre. An alert was triggered based on a log condition, but when I investigated, I discovered the log was from more than three hours ago!

Here's what the delayed log looked like:

{
  "severity": "ERROR",
  "message": "Database connection failed",
  "timestamp": "2024-10-19T10:15:00.000Z",
  "receiveTimestamp": "2024-10-19T13:15:03.421Z",
  // Over 3 hours difference!
}

This wasn't just a minor inconvenience - it had serious implications:

Log-based alerts were delayed, making them less useful for real-time monitoring
Backed-up logs were consuming memory
Some instances were failing to accept requests with timeout errors

During investigation, these commands were invaluable:

# Check container logs for OOM issues
gcloud logging read "resource.type=cloud_run_revision AND textPayload:\"OOMKilled\""

# Monitor memory usage in real-time
gcloud run services describe my-service --format='get(status.conditions)'

The root cause analysis

After investigation, I pieced together what was happening:

Logs weren't being sent immediately to Cloud Logging
They were accumulating in memory
Eventually, there wasn't enough memory to even handle new requests
The application would start throwing timeout errors

The worst part? If an instance terminated while logs were stuck in memory, those logs would be lost forever. It's not ideal for observability!

The Solution: `stdout` to the rescue

The fix turned out to be surprisingly simple. Instead of using the Cloud Logging API directly, I modified the configuration to write logs to standard output. Cloud Run automatically forwards stdout to Cloud Logging, making this a more efficient approach.

Here's how to implement it:

import { LoggingWinston } from "@google-cloud/logging-winston";
import { createLogger } from "winston";

const logger = createLogger({
  transports: [
    new LoggingWinston({
      // This redirects logs to Cloud Run's stdout
      redirectToStdout: true,
      // Default is true, affects log structure.
      // Prevents wrapping logs unnecessarily
      useMessageField: false,
    }),
  ],
});

The results were crazy!!!

After implementing this change, the improvements were immediate and significant:

Memory metrics:

Before:
- Peak memory usage: 3.8GB of 4GB (95%)
- Average memory usage: 3.2GB (80%)
- Frequent OOM risks
After:
- Peak memory usage: 1.6GB (40%)
- Average memory usage: 1.2GB (30%)
- Zero OOM incidents in 30 days

Performance metrics:

p95 latency decreased from 800ms to 300ms
Error rate dropped from 0.5% to 0.02%
Average CPU utilization dropped from 85% to 40%

Cost impact:

40% reduction in Cloud Run instance hours
35% decrease in logging costs
Estimated annual savings: $2,400

I validated the improvement with a load test:

Tool: Artillery
Duration: 30 minutes
Load: 0 to 500 RPS ramp-up
Payload: 1KB JSON messages

The stdout version handled 2x the load with half the memory usage.

The blue line in the center of the following images indicates when the change was applied.

Important considerations for structured logging

When using stdout logging with Cloud Run, there are some crucial points to keep in mind:

Handle null values in labels carefully: If any label contains null, it gets stored in a jsonPayload field under logging.googleapis.com/labels. I solve this by converting all label values to strings before logging.
Timestamp precision: Always pass timestamps as Date objects. Otherwise, the logger might override your timestamp with its processing time.
Trace configuration: For request tracing, set up the X-Cloud-Trace-Context header value in your logs. This lets you filter logs by request, which is invaluable for debugging. Learn more in official docs
Instance identification: Cloud Run automatically includes labels.instanceId, which is extremely useful for debugging instance-specific issues.

Alternative approaches I considered

Before settling on stdout logging, I evaluated several other options:

Optimizing API calls:
- Adjusting API call frequency
- Batch processing logs
- Implementing retry logic
Using logging agents:
- Exploring the Logging agent (though it's no longer in active development)
- Investigating the Ops agent (more suited for Compute Engine)
Custom buffering solutions:
- Implementing an in-memory buffer with size limits
- Creating a separate logging microservice

The stdout approach won out due to its simplicity, effectiveness, and built-in support in Cloud Run.

Have you encountered similar logging challenges in your Cloud Run applications? I'd love to hear about your experiences and solutions in the comments below!

For more tips and insights, follow me on Twitter @Siddhant_K_code and stay updated with the latest & detailed tech content like this.

sre Article's

30 articles in total