-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix vector merge algorithm to prevent the tree depth from growing #35
Comments
I have been unhappy with the RRB documentation since I started implementing it nearly 2 years ago. I only really understood the intention of the concatenation balancing algorithm relatively recently. The concatenation algorithm presented there is much more approachable than the one from L'orange. Unfortunately, I suspect that this algorithm bounds the height to O(log(N*C)) where C is the number of concatenations that have occurred. In the Currently, my implementation also produces these problematic trees, however if you implement I'm not sure what this means for this means for imbl's vector, it would be possible to implement L'oranges librrb concatenation algorithm with a lot of effort, His actual implementation is not pleasant to look at it, but at least it works. |
Thanks, that's super helpful! I hadn't read Stucki's thesis before. The algorithm there is clear, at least, although I'm not sure what properties it's supposed to have. I guess the inputs to |
Sorry, I've been pretty busy recently. I meant to add that what after your comment is that what L'orange (and the original paper) calls left-packed is a regular M-1 - M B-Tree. I think its this invariant that is broken, either by a bug in I'm happy to help fix the implementation if you want? The first thing to do there is ensure those |
Sure, I'd be happy for any help you have time for! I am a bit confused about the M-1 -- M trees thing, though, because let's say M=32 and you have a tree that's almost completely dense except that it's right-most node has only 1 leaf, and you append a tree that's completely dense. If you want to end up with a 31 -- 32 tree in the end, then you either end up shifting/copying all the leaves or you have to distribute the "missing" 31 entries over 31 different nodes. None of the implementations (or papers) that I saw said anything about distributing the gaps in this way, so for that reason I figured they were going for the other invariant mentioned in the paper: the one where you just have an upper bound on the number of linear search steps you need to take. AFAICT that invariant allows you to have leaves that are way underfull, as long as there aren't too many of them. |
I believe the copying and shifting of is handled by I figured they were using a combination of the two invariants. One invariant to guarantee the tree does not become too lop-sided and another to faciltate radix-searching. You could be right, though, and they are just using the linear search step invariant only. I'll do some reading to try and clarify that. |
I've had another read through of imbl's concatenation and the algorithm in Nicolas Stucki's thesis. I'm fairly certain the two correspond pretty much exactly and there no major differences. The only thing I can think of is that Stucki's algorithm might just be a shifting of nodes instead of a rebalancing. The RHS of the concatenation steps will almost be have M^2 subtrees and the LHS be sparse. This will lead to the LHS getting M^2 subtrees and the RHS becoming sparse. This just causes the sparse part to move to another part of the tree and can still cause problems. This is just an educated guess, I don't have too much evidence other than what I've seen. Also, you were right. They do not specifically guarantee M-1 -- M trees, but they do have a process where they compact trees with fewer than M - e/2 subtrees. The default value for e seems is 2 in librrb. In a test case like I think the next step might be to implement L'oranges concatenation algorithm in imbl. We can grab implementation details from L'orange's librrb, but it is complicated a bit by the library design itself. |
Sure, that sounds like the logical thing to do. I'll also push a branch I have lying around that has a bunch of debugging printfs and reduces the node size to 4. It makes stepping through example much easier... By the way, one thing I'd really like to produce is some document (maybe a blog post or something) really explaining the algorithm. I think that's something sorely missing right now... |
Ok, I've read |
Its kind of silly but there is another parameter I think it would be great to write some sort of blog post on this when we get it working since existing documentation on the RRB tree is just so lacking. |
Right, so the way I was thinking about it was that |
Okay, so I went through some of the literature again (I found that L'orange's website also links to a youtube video by Phil Bagwell on RRB's trees). They don't seem to actually have a height guarantee (if you look at L'orange's thesis 3.4.3, they have a lgN guarantee in some situations). I think I could implement that concatenation algorithm now and understand what I would need do for it, I'm more familiar with my own RRB implementation (which I am also rewriting anyay), but I think I could implement it in im/imbl with some exploration. |
That would be awesome! So my understanding is that they do guarantee logarithmic height if all you do is concatenation, but as soon as you start mixing that with other operations then the guarantees go away? |
I believe what you said makes sense. In particular I think the only operation you need to worry about with concatenations is splits. Sorry work picked up and I've had more or less no time to look at this over the last few weeks. I will be able to get a fresh look at it closer to the end of the year. Maybe I should go over my plan with you. I have a private repo that I'm implementing my own RRB tree external to imbl. (It's also external to my other RRB implementation that I'm using in other projects). I will invite you to the repo when I'm able. It follows a different structure and has different invariants and properties to imbls one, but I find it much clearer to implement it this way. Also the invariants aren't so dissimilar as to make it not an RRB tree. After I've done testing on my implementation I will report back to how that goes and hopefully implement it in imbl. |
Great! Coincidentally, I have also just managed to find time to do some work here. I'm trying to get some better tests, fuzzing, and benchmarking going, in the hope that it will make it easier to change the algorithm with confidence... |
gonna go with a less inflammatory title... |
Prompted by #33, I dug into
Vector
some more this weekend and I'm not super encouraged by what I found. I made some tweaks to support smaller branching factors (because when the branching factor is large then it's really hard to hit corner cases with proptesting and fuzzing), and I immediately ran into some failed tests. In particular,test_inserts
caused a panic, and when debugging the panic I found that it was producing very large skinny trees. Now some issues may have been caused by my amateurish attempts to support smaller branching factors, but the skinny tree thing seems to be an issue in unmodifiedim
also: if I modifytest_inserts
to make larger vectors, I see that vectors of length40000
ish already have tree height 7. But with a branching factor of 64, height 3 should already get us to 200k!Part of my difficulty is that I don't really understand the RRB merge algorithm. The description in the original paper is not really complete, in my opinion, and neither is the one in L'orange's Masters thesis. It appears I'm not the first person with this complaint, see this, and also this and this for more background. @nomad010 you said you had an independent re-implementation of the RRB algorithm -- do you have any thoughts on this?
The text was updated successfully, but these errors were encountered: