Logo

dev-resources.site

for different kinds of informations.

Beware of the DashMap deadlock

Published at
3/29/2024
Categories
rust
deadlock
dashmap
tech
Author
Benjamin Kampmann
Categories
4 categories in total
rust
open
deadlock
open
dashmap
open
tech
open
Beware of the DashMap deadlock

Rust is famously build for the multi-threaded-processor world. From its core ownership-enforcement-model up to the type-based Sync + Send-types, all is around allowing the compiler to ensure memory safety and consistency across thread boundaries. And though the std also has collections (like HashMap and BTreeSet), Atomics and Locks, once you start building real programs with Rust, probably some tokio for async-support as well, these are not always sufficient.

DashMap for the Win

No wonder that you can pick from a handful of libraries helping you achieve this feat, and quite a feat that is, with DashMap being among the most popular with a whopping 52million downloads on crates.io at the time of this writing:

Screenshot of the crates.io entry for dashmap

52million downloads, years since publication, updated just a few months ago. That looks like a reasonably sound library. So, you start playing with it and find that its API is convenient, seems to work across awaits and async and all the things you've ever dreamed of. So you implement it as the main caching layer for the transient state machine of models within the core business logic of your application.

Deadlocks? Really?

I is only a long while later, until the first reports come in. Sparse at first and unclear in its origin, but sometimes, it seems, your state machine processing doesn't process the events coming in. Or, better - as you learn when digging into it - their futures never resolve.

You see, already known since at least December 2022 is that you can use DashMap in a way that can cause deadlocks and without the compiler detecting them.

A primer on dead locks

If you have no further knowledge about locks and have never heard of deadlocks, let me give you a minimal rough cut of the problem (overly simplified): Sometimes you have memory that is accessed by multiple threads, but clearly if both write at the same time this can cause problems. Thus, the concept of "locks" was created: small pieces around the memory that you need to have hold of first before you can write to that memory. While it is locked no other thread can write to it and thus have to wait for their turn. Ensuring they all write-one-at-a-time and not in between one another.

Now, how ever long you hold that lock is your prerogative and there are several problems with holding a lock very long. For example: what if your thread panics while you hold the lock? This in rust is usually referred to as a "poisoned lock", you might have seen that Error in the std, and how to deal with that depends on the specific code.

In this case, we are looking into a so called dead-lock situation. This can even be cause by a single thread easily: when you hold the lock and your code, running on the same thread, for whatever reason, tries to acquire the same lock while still holding the lock. This stops the execution as the thread is waiting on the lock it itself is holding and thus preventing from being released.

This type of scenario can be and in the std-cases is detected by the rust compiler (yay), but not in the case of DashMap. As DashMap actively allows for locks to be held over await-points (that is kinda its jam ... that it allows the user to do that), it isn't possible for the compiler to figure out that this might lead to a dead lock.

How to avoid that problem

The best advice is the one given from [Alice in her post from January 2022] already:

It is especially important to follow the advice about never locking it in async code when using dashmap for this reason.

While this is good advice, this isn't really mentioned on the docs of DashMap and considering there is nothing detectable wrong with the examples showing the problem when looking at the code this is quite the foot gun.

However, the code in question in our case doesn't even hold any locks over await-points, yet it seems to deadlock in some race condition scenarios.

Then you only find out about it after some long debugging and researching the github issues of that dependency. Taking all that into account, and then that there is no real way for you to create tests or otherwise automatically ensure it isn't reintroduced by any further update the code ... I consider this pretty harmful.

What to do about it?

Well, supposedly, this is fixed the in next big iteration of DashMap, which is said to have async support by getting rid of locks entirely, but with the issue open since 2021 and most of the ideas of how to avoid the locks being discounted for now, there is no telling when this come - if ever. What I have seen most people do referencing that issue, and what we also ended up doing is: replace or at least remove DashMap from the code base.

In our case we replaced it with the up and coming scc, which uses a different locking concept and has the additional benefit of being faster. Others have opted for cachemap2 or replaced it with the std lock & hashmap: there at least the compiler will tell you if you accidentally created a dead-lock-scenario.

No disrespect

I am not writing this post to shit on the authors of DashMap, nor its contributors or maintainers. Building a async-safe lock-free-ish collection is a hard task. One that I wouldn't even really want to attempt myself. I still personally don't even understand why this deadlocks internally myself, nor would I consider trying to patch it either - considering that they haven't done it yet makes leads me to believe this isn't an easy thing to do. As such I don't think anyone should be mad about them either, call them names or do any of the other nasty things the internet can do to people that lost its favor.

I am raising this issue because this is a pretty widely spread library, probably the most popular for the concurrent hashmaps and this is a severe problem that you should know about when using it. That's why I spent a significant amount of this post explaining the core problem and how to avoid it. So, if you use DashMap and want to continue using it, you know what to look out for now.

Edit 2024-04-01: Edited for clarity, based on corresponding feedback and removed a misleading quote.

Featured ones: