dev-resources.site
for different kinds of informations.
Chain - a Goofy, Functional, Tree-backed List
What?
Java has array-based List
s for efficient random access; there are LinkedList
for efficient appending. Who needs a tree for List?
Well, hear me out, just for the fun alright?
Once upon a time
Agent Aragorn (Son of Arathorn), Jonny English and Smith collaborated on a bunch of missions.
The Mission
class's signature is like:
abstract class Mission {
abstract MissionId id();
abstract Range<LocalDate> timeWindow();
abstract ImmutableSet<Agent> agents();
}
The goal is to create an ImmutableRangeMap<LocalDate, Set<Agent>>
(RangeMap
is a Guava collection that maps disjoint ranges to values) to account for all the agents during each time window. Note that missions can have overlapping time windows, and agents could work on multiple missions at the same time.
So for missions like:
missions = [{
timeWindow: [10/01..10/30]
agents: [Aragorn, English]
}, {
timeWindow: [10/15..11/15]
agents: [Aragorn, Smith]
}]
I want the result to be:
[10/01..10/15): [Aragorn, English]
[10/15..10/30]: [Aragorn, English, Smith]
(10/30..11/15]: [Aragorn, Smith]
At first I thought to use the toImmutableRangeMap() collector, as in:
missions.stream()
.collect(toImmutableRangeMap(Mission::timeWindow, Mission::agents));
Voila, done, right?
Not quite. My colleague pointed out that toImmutableRangeMap()
does not allow overlapping ranges. It wants all input time windows to be disjoint.
RangeMap
can merge()
The TreeRangeMap
class has a merge() method that already does the heavylifting: finds overlappings and splits the ranges, and then merges the values mapped to the overlapping subrange.
With some effort, I created a toImmutableRangeMap(merger) BiCollector
on top of the merge()
function.
So if what I needed is just to count the number of agents, I could have done:
import static com.google.mu.util.stream.BiStream.biStream;
ImmutableRangeMap<LocalDate, Integer> agentCounts =
biStream(missions)
.mapKeys(Mission::timeWindow)
.mapValues(mission -> mission.agents().size())
.collect(toImmutableRangeMap(Integer::sum));
(It'll double count the duplicate agents though)
Anyhoo, here goes the interesting part: how do I merge the Set<Agent>
?
Quadratic runtime
I could use Guava's Sets.union()
:
import com.google.common.collect.Sets;
ImmutableRangeMap<LocalDate, ImmutableSet<Agent>> agentsTimeline =
biStream(missions)
.mapKeys(Mission::timeWindow)
.mapValues(mission -> mission.agents())
.collect(toImmutableRangeMap((set1, set2) -> Sets.union(set1, set2).immutableCopy()));
The gotcha is that each time merging happens, merging two original sets into one is O(n)
where n is the number of agents from the two overlapping ranges. If we are unlucky, we can get into the situation where a time window is repetitively discovered to overlap with another time window, and we keep copying and copying over again. The time complexity is quadratic.
Stack overflow
Could I remove the .immutableCopy()
? Sets.union()
returns a view that takes constant time so we should be good?
Not really. We don't know how many times merging will happen, a Set
can be unioned, then unioned again for unknown times. In the worst case, we'd create a union-of-union-of-union N levels deep. If N is a large number, we'll stack overflow when we try to access the final SetView
!
The same will happen if for example I use Iterables.concat()
or Stream.concat()
(javadoc discusses this problem).
And in case it wasn't obvious, the merging cannot modify either of the two lists or sets, because they are still associated with the sub-range that doesn't overlap. So we need it to be immutable.
If I had one of the persistent collections in the dependencies, I might just use it (they offer O(logn) performance for concatenation but usually are close to constant time). But I don't. And it doesn't feel like worth it to pull in one of such libraries for a single use case.
Put it in a tree
I slept on this problem for about two days for an idea to come to me: can we use something like Haskell's List
?
Tl;dr, Haskell's List is like LinkedList
except it's immutable. So given a list of [2, 3]
, you can cons
the number 1 onto the list to get a new instance of [1, 2, 3]
. Under the hood it's as simple as creating a new object with the internal tail
pointer pointing to the old [2, 3]
list.
If I can do this, each time merging happens, I only need to pay O(1) cost. The resulting object is probably less efficient for random access than ArrayList
or Guava's ImmutableList
because of all the pointers and indirections. But that's okay. When the whole split-merge process is done, I can perform a final copy into ImmutableList
, which is O(n).
The only problem? Haskell's cons
only allows to add one element, while I have two List<Agent>
s to concatenate (I can't cons
every element from one of the lists, because then I'm getting back to quadratic).
To support concat(list1, list2)
, I decided to use a binary tree to represent the List's state:
private static final class Tree<T> {
final T mid;
@Nullable final Tree<T> left; // null means empty
@Nullable final Tree<T> right; // null means empty
Tree(Tree<T> left, T value, Tree<T> right) {...}
}
In the list, the elements in left
show up first, followed by mid
, then followed by the elements in right
. In other words, an in-order traversal will give us back the list.
The key trick is to figure out how to concatenate two binary trees into one. Intuitively, I need to find the new "mid point" value, which can be either the left
tree's last element, or the right
tree's first element. Say, if I take the right
tree's first element, then the new tree's left
remains the old left
, while the new tree's right
would need to be the old right
after removing the first element.
Wrap it up
Since the Tree is immutable, how do I remove any element at all? And in a binary tree, finding the first element takes up to O(n) time (it's not a balanced tree).
It turns out there's a law in computer science:
All problems in computer science can be solved by another level of indirection
In human language: if a problem can't be solved with one layer of indirection, add two :)
Here goes my second layer of indirection that handles the remove first element from an immutable list task:
public final class Chain<T> {
private final T head;
@Nullable private final Tree<T> tail;
public static <T> Chain<T> of(T value) {
return new Chain<>(value, null);
}
public static <T> Chain<T> concat(Chain<T> left, Chain<T> right) {
T newHead = left.head;
Tree<T> newTail = new Tree<>(left.tail, right.head, right.tail);
return new Chain<T>(newHead, newTail);
}
}
This is quite like Haskell's cons
list, except the tail
is a binary tree instead of another cons
list.
Now because both left and right Chain
already have the first element readily accessible, I can just take the right head
as the new mid point to build the new tree, with the tail from left and the tail from right. This new Tree maintains the order invariant as in left.tail -> right.head -> right.tail
. And of course the left's head becomes the new Chain
head.
It takes a bit of brain gymnastics. But if you sit down and think for a minute, it's actually pretty straight forward.
This solves the O(1) concatenation. And the good thing is that, no matter how deep concat()
is nested, the result is always one layer of Chain
with a heap-allocated Tree
object.
Now we just need to make sure when we iterate through the Chain
, we take no more than O(n) time, and constant stack space.
Flattening tree back to List
My secret weapon is Walker.inBinaryTree() (but you can create your own - it's standard tree traversal stuff). It already does everything I needed:
- O(n) time lazy in-order traversal.
- Constant stack space.
Using it is pretty simple. First we add a stream()
method to the Tree
class:
private static final class Tree<T> {
...
Stream<T> stream() {
return Walker.<Tree<T>>inBinaryTree(t -> t.left, t -> t.right)
.inOrderFrom(this)
.map(t -> t.mid);
}
}
The inOrderFrom()
method returns a lazy stream, which will take at the worst case O(n) heap space and constant stack space.
Then we wrap and polish it up in our wrapper Chain
class:
public final class Chain<T> {
...
/**
* Returns a <em>lazy</em> stream of the elements in this list.
* The returned stream is lazy in that concatenated chains aren't consumed until the stream
* reaches their elements.
*/
public Stream<T> stream() {
return tail == null
? Stream.of(head)
: Stream.concat(Stream.of(head), tail.stream());
}
}
With that, it gives me O(n) time read access to the tree and I can easily collect()
it into an ImmutableList
.
In the actual implementation, I also made Chain implements List
to make it nicer to use, and used lazy initialization to pay the cost only once. But that's just some extra API makeup. The meat is all here.
In Conclusion
Chain
is a simple immutable List
implementation that you can append, concatenate millions of times.
A bit of googling shows that people have run into similar needs but I didn't find a similar implementation that handles both the O(1) concatenation time and stack overflow concern.
Featured ones: