Logo

dev-resources.site

for different kinds of informations.

Two Lines of Code Eluded me for Several Months

Published at
1/16/2024
Categories
scala
async
threading
programming
Author
jonneufeld
Categories
4 categories in total
scala
open
async
open
threading
open
programming
open
Author
10 person written this
jonneufeld
open
Two Lines of Code Eluded me for Several Months

Sigh

I’m opening with a sigh because several months of flipping switches, wiresharking, and scrolling through logs did not teach me much. I have coding experience reaching back to the days of the Nintendo 64, and yet, most of it proved useless to me this time. Was I lost in the jungle of cryptic frameworks? Bushwhacking through the brush of spotty documentation? That’s what I keep asking myself. But as I observed the pirouette of bits and bytes scroll by my terminal window, I mused on how this intractable exercise left my wheels spinning rather than my brow perspiring.

What I didn't know was that I had come face-to-face with a bug that would teach me a lot about myself

Architecture Complexity

Our enterprise application runs a handful of microservices in the cloud bound together with the Google Remote Procedure Call protocol (gRPC), a framework for microservices to communicate with each other over a local area network. One of these microservices consumes events produced from other microservices.

too many events spilled out into the void of cyberspace, and there was no clear indication why. All I had to work with was this spuriously mysterious message repeated over and over again—ad nauseum. It was the same message that would torment me for the next several months no matter what I threw at it:

Upstream producer failed with exception, removing from MergeHub now
Enter fullscreen mode Exit fullscreen mode

This proved to be only a tiny piece of a larger puzzle. I have learned a great deal about these frameworks involved, but little of that proved useful to diagnosing and solving this spurious problem.

Virtual Ghost Town

An isometric view of a city with circuits for highways and chips for buildings

There wasn’t a great deal of documentation available for the frameworks involved, and what was available was often wrong. Furthermore, community support forums were sparse ghost towns. I was fortunate to get any reply at all. In retrospect, I found that no documentation was more helpful than poor documentation, and there was no substitute for someone willing to listen, understand, and provide guidance.

I found that no documentation was more helpful than poor documentation

I was on my own. Either that, or I was exhausted and disillusioned by googling. I can only think of a handful of times during my career where web searching was no help to me, and this was one of them. It took me several months to solve it due to the exhausting and demoralizing nature of it. I keep having to set it aside to work on something else so that I could experience a win every now and then.

Reproducing the Issue Locally

Running our services locally did not cause the issue, but I knew if I were to make any significant progress diagnosing the problem, I would have to reproduce it locally. Eventually, I was faced with a conundrum—Continue working with the existing code, tweaking it as necessary in an attempt to reproduce it, or invest some time to write custom producers and consumers based on the original code. The latter offered a great deal more flexibility and control, but I was dissuaded for some time by the opportunity cost of writing two brand new throw-away components.

Ultimately, I bit the bullet and wrote the custom sub-modules despite feeling hopeless about it and doubtful that it would change anything.

I was wrong. It changed everything.

Not only was I able to reproduce that stupid error message, but I was able to reproduce it reliably. I just needed sufficient volume to overwhelm the consumer like a burst dam.

Despite finally managing to reproduce the issue locally, I still lacked clues. There was nothing obvious from the code, but it was clear that the framework didn’t like something about it.

Alone in the Woods of Static Analysis

A man standing in-front of a room cluttered with objects depicting various software engineering icons

One of the most laborious and mentally exhausting exercises is the static analysis of a code to understand how it works. This was the next step in attempting to diagnose the root cause of the error message. One of the quickest ways to hang yourself with this approach is to make assumptions. In retrospect, I found I didn’t even realize many of the assumptions I was making while reversing the third-party framework generating the error. I had to constantly remind myself that you don’t know what you don’t know, which became a personal mantra for inductive reasoning.

I spent weeks in careful static analysis, often reviewing the same code paths more than once or twice. I had a couple of doubts floating around in my head while I was on this quest—doubt that I would find any reliable answers, and doubt that I understood how the framework actually worked. With time, I enumerated a few potential outcomes (given certain inputs), and one led to a surprising discovery.

Eureka

After several months, I cracked the case, and it was a two-line fix. What happened was the consumer kept shutting down these short-lived streams prior to consuming all of the events contained within it. It was an incorrect assumption made by an engineer (no longer with us), and I ignored it because I trusted he understood it accurately. This was another important lesson I learned— Never skip over a code written by other engineers and assume that it works as intended. Normally, code review would stop bugs like this in their tracks.

Lessons Learned

To wrap-up, here are the top three lessons I learned from this epic saga.

  1. No documentation is more helpful than poor documentation.
  2. Sometimes, it’s worth biting the bullet and absorbing the opportunity cost of writing some throw-away reproducible test case components in order to troubleshoot a complex problem.
  3. The saying, “you don’t know what you don’t know” is a great reminder to be more careful about making assumptions

These three lessons have one thing in common: incorrect assumptions lead to incorrect conclusions.

Conclusion

I hope my story helps other developers who are stuck with a difficult, intractable, and flaky problem that isn’t easy to reproduce. If I were to do it all over again, I would periodically review and examine the assumptions I was making thoroughly because if there’s anything I’ve learned from this journey, it’s incorrect assumptions eventually lead to incorrect conclusions.

threading Article's
30 articles in total
Favicon
Concorrência e paralelismo em Python
Favicon
Navigating Concurrency for Large-Scale Systems
Favicon
Common Java Developer Interview Questions and Answers on multithreading, garbage collection, thread pools, and synchronization
Favicon
Real-time plotting with pyplot
Favicon
A Quick Guide to the Python threading Module with Examples
Favicon
Understanding Threading and Multiprocessing in Python: A Comprehensive Guide
Favicon
I Asked Copilot to Explain Threading in Python to a Dog
Favicon
Introduction to GCD (Grand Central Dispatch)
Favicon
Achieving multi-threading by creating threads manually in Swift
Favicon
Swift Concurrency
Favicon
Python Multithreading: Unlocking Concurrency for Better Performance
Favicon
Choosing the best asynchronous library in Python
Favicon
Two Lines of Code Eluded me for Several Months
Favicon
A Comprehensive Guide to Python Threading: Advanced Concepts and Best Practices
Favicon
Thread synchronisation
Favicon
Rust Learning Note: Multithreading
Favicon
Async vs Threading vs Multiprocessing in Python
Favicon
02.Android Background Task
Favicon
How to handle threads with multiple gunicorn workers to get consistent result
Favicon
Tasks, BackgroundWorkers, and Threads
Favicon
Understanding Task.WhenAll in C#
Favicon
Producer/consumer pipelines with System.Threading.Channels
Favicon
How to auto-refresh Realm inside Android WorkManager
Favicon
Understanding Task in .Net
Favicon
Como resolvemos um bug que afetava 3000 usuários e nos custaria milhões por ano
Favicon
Java Thread Programming (Part 1)
Favicon
So, you want to launch several threads in Python and something does not work?
Favicon
Higher level threading in C++
Favicon
Solve the scenario - using Thread synchronization in Dotnet - CountDownEvent
Favicon
Что в процессе тебе моем?

Featured ones: