Skip to content
Alan Gutierrez edited this page Sep 25, 2020 · 7 revisions

My secret developers journal, where I scribble down all my deepest, darkest Strata secrets.

Thu Sep 24 19:56:10 CDT 2020

It might be nice to have destruction of a Strata not prevent additional searches, only to stop appending and housekeeping. This may simplify the shutdown of Amalgamate based databases where we'd like to object to the user continuing to write to staging trees, but we ourselves would like to continue to read from staging trees to perform a final amalgamation at shutdown.

That is the purpose of this entry, but I'll note that I do want an internal means of testing data races, and that I'd like to defer the housekeeping queue at shutdown, create a log of keys of pages that are in the housekeeping queue to be resumed at the next startup. Now that we're getting into ever more mature applications (yet still alpha applications) we seek the ability to have varying degrees of shutdown, hard to soft.

Fri Jan 20 00:19:16 UTC 2012

Mutate and Balance in a Single Pass

In traditional b-tree literature, one of the properties of efficency of the b-tree is that you can split the nodes of the b-tree by traversing back up the tree splitting full nodes after an insert.

In a concurrent implementation, however we need to lock the the parent of the first page to split. This makes that efficent property inefficent.

As in most b-tree literature, I'm going to use split as an example in describing exclusive lock strategies, and then wave my hands and say that merge is pretty much the same thing. That is, I'm going to lie to you. (Waves hands.)

In reality, you don't know that a page needs to be merged simply by visting it. You will only know that it needs to merge by looking at the page and both its left and right sibling pages.

Exclusive Write Locks on the Way Down

Balancing on Insert or Delete

If we wanted to take advantage of the fact that we can determine the pages that need to be split while we descend the tree, to split the pages on the way back up the tree, we will need to lock those pages as we see them.

To prevent deadlock, we've established an ordering of lock acquisition. A descent can obtain a lock the child or right sibling of a page on which it holds a lock. We are always moving top down, or left right.

We may make a list of pages we need to split on the way down, but we won't be able to back to obtain exclusive locks on them. We can only obtain a lock on a page if we hold a lock to its left sibling or parent. We release the parent lock as part of traversal to keep the b-tree lively. The only way to implement insert and split in a single pass is to hold the locks on the way down.

We'd hold them in a queue. When we completed our insert, we'd have to upgrade our read locks to exclusive locks, moving forward through the queue performing the upgrades. Then we'd move backwards through the queue performing splits.

When you upgrade from read to write, you're not garaunteed to be first descent to get the upgrade. Another descent might get the upgrade, alter the pages of interest, and render your plan for balancing the tree invalid.

To prevent this, we could upgrade from shared to exclusive the first time we encouter a full page. After we've obtained our exclusive lock, we know that we've blocked any other descents from descending into the sub-tree, so our plan for balance will remain valid as we descend the tree, creating a queue of exclusively locked pages to split.

However, If any decendant page of the exclusively locked page is not full, the exclusively locked page will not be split, so we have to empty our queue of exclusively locked pages, releasing the exclusive locks abtained along the way. We have wasted our time waiting on those exclusive locks, while making every other desecent that follows in our path page wait for us to waste our time.

If it were the case that the full page with the potential to split were the root branch page, then our entire b-tree would be locked exclusively for each insert, until the b-tree grew to the point there the root page split. We can hardly call that an efficency.

With our queue of read locks, we'd have to re-test the conditions after we obtained exclusive locks. We could them proceed to split, if none of the pages involved had changed.

At this point however, we've created quite an aparatus to take advantage of a supposedly convienent side-effect of desending the tree for insert. We've added a lot of complexity to the act of insertion.

The best part is, it is only for insertion and split that this effeciency applies. You can't determine if a page needes to be merged without also looking at its siblings. You can't merge a page with its left sibling without first visiting that sibling through a descent of the tree.

It might be possible to do delete and merge in a single pass with a different b-tree structure, one with cached pages sizes, or page sizes kept in the parent, but not with the b-tree structure we've chosen here.

Instead, we balance our tree as a separate step from changing its size. We first insert or delete a record. We then split or merge the pages if we determine that it is necessary.

Balancing in a Single Pass

We will know that split is necessary when we have filled a leaf page beyond its capacity. If the parent of the leaf page is at capacity, splitting the leaf page will put the parent of the leaf page beyond its capacity, and it too will need to be split.

After insert, when we know that a split is necessary, we descend the tree again, this time with an intent to split the leaf page. Because we are a concurrent data structure, we might have two inserts complete at the same time, initiating two splits. The split operation will check that the split is still necessary.

Now we are faced with the same temptation to employ for the efficencies we've weighed. Should we attempt to perform all the necessary splits in a single pass?

We know there will be at least one split, so if we lock exclusively on the way down, we know that at least one of those locks is necessary.

We still face the condition where the leaf is full beyond capacity, the root is at capacity, but the branch page between them is below capacity. We will detect that the root has the potenital to split, lock it exclusively, but then find that the lock was unnecessary because it will not actually split.

This unnecsssary lock won't occur at every insert however, only at every leaf split.

In order to determine if splitting in a single pass after insert is worth are while, lets look at how we would split in multiple passes.

Thu Jan 19 22:29:47 UTC 2012

Traversal Rules

Oddly, some of the rules here may not be necessary. I've often been unable to express my designs, because the rules that I'm following may be too many. It may be the case that you could remove one of the rules, and still deadlock will never occur. I'm not trying to define the mininum requirements to prevent deadlock. Any set of requirements that prevent deadlock will do. A handful of requirements too many doesn't matter, so long as I get the desired behavior our of the implementation.

If one of them prevented a necessary b-tree operation, I'd think hard about how to do without. It could be the case that the requirement is obviously unnecessary to prevent deadlock, but I won't see its uselessness until there is a reason to see its uselessness.

I'm not adding rules to be on the safe side. An unnecessary rule is unnecessary. Claiming that an unnecessary rule exists to err on the side of caution, is to imply that you're not open to hearing about how it is not necessary. It is defensive.

It is an interesting exercise in admonishment oriented programming to consider how I'd fret about being told that something that has no ill effect is unnecessary.

I've learned that it is good to question your assumptions, because you may discover new utility behind those assumptions. It is easy to make yourself a useless, undirected fop, if you take this too far, but being resolute doesn't do you much good in software. It truly limits your potential.

Java has cubicles built right into the language. It is admonishment oriented programming at its finest. C++ is full of admonishments, but those admonishments are backed by a standard, and by real world experience in developing C++ code designed to run across architectures. Java has admonishments for the sake of having admonishments to deliver.

One of the things that strikes me about my implementation is how much time I spend in Java worrying about encapsulation. As if any component within the b-tree could be reused. That some day, I'd be creating a linked list, and say, oh, hey, a node is a node. I can repackage the b-tree page to be an in memory node, for any linked data structure.

In fact, I did break the b-tree up into separate projects. File backed data storage was a separate project, as was a file structure that acted as a malloc for a file, because it wouldn't do to create a directory on the file system, the b-tree would have to implement journaling on its own. Otherwise, how could it be write once, run everywhere?

Not sure what I was thinking, really. Because I was rational, except when I thought about all the admonishments I'd hear when I produced it.

Sat Dec 10 06:10:25 UTC 2011

Now that I'm not spending all of my energy thinking about storage, it gets easy to imagine different ways to organize the tree. Merging leaves that are in separate trees, for example, would be noticing a piviot located at the beginning or the end of an inner tier on the way down, then going in both directions.

But, I don't have anything working now, and I've not been able to finish this in eight years, so I'm going to try to design this with my thoughts from the Java world.

Oh, all the things that I was thinking, making things swappable, pluggable, which is all nouns, and not verbs.

Okay. The only thing that we need to figure out for the subsequent pass is the bottleneck tier, the one that needs to be locked to squeeze out the rest. When that is locked, we can get down to the leaf that needs to split, figure out of we want to merge to the left or right, or if is at the end of the tier...

Actually, we need to know where we might swap, that is, it is not the first one that needs to be broken, it is the pivot, so we are going to need to know if we're going to need to swap.

When we are merging to the left, it is easy, we know how to find the pivot on the way down.

Okay. If an item is a pivot, then it is the first item in a list. This matters if we are deleting because we need to swap the pivot. If we are looking to expand a leaf tier so it splits, then it needs to only be in the first tier, it doesn't have to be the first item.

If go to the left most in a tier, or the right most in a tier, we need to lock the parent. Chaning the structure of the tree is only a matter of changing the pivot. Thus we have a full leaf, we can split it, we could give some of it to a neighbor. It is merge, where we want to merge with a neighbor, we can reach a case where we've created leaves that won't merge.

Also, how do we merge inner tiers?

We can merge everything at once. Or we can merge on each pass. Each merge is a new descent, in order to find the best candidate for merge.

Or you have a merging timer. Oh, how easy is that? Have a worker that does the merge. Oh, how easy is that?

Fri Dec 9 12:43:31 UTC 2011

Append mode, for ordered data. Test leaf reading by calling directly, by also having a hidden option to passing variables for tuning. Use this in production somewheres, like behind prolific.

Tue Apr 19 12:14:38 CDT 2011

The tiers are going to be arrays. To create a special sort of tier, you augment it. Strata will read arrays using index. Ah, no.

Better still to have an array like interface, but use a get method. Then, if we ever do want to do memory mapped b-trees, with binary records, that must remain sorted in memory, we have a way to get there, but we can use arrays while developing the tree using the im memory I/O strategy..

Also, when writing memory mapped, you are writing in place. You're going to need a write ahead log somewhere. I believe there's a way to preserve the index. It is a matter of making the final writes change state, but keeping the pointers somewhere, say when draining the root tier, you overwrite the first two branches, but leave the rest of the tier in place. If you fail you can see if the tree is copacetic. Can't we also keep counts in the tiers?

Oh, or you could keep back pointers to parents in children, so that, so long as you can get to the left most tier you can iterate through all of the tiers, perhaps quickly becasue you're only reading a little header of each leaf.

Anyway, syncing two records ought to be atomic enough. If not, then we must be able to detect a bad tree and rebuild it.


It occurs to me that things could be simplier if instead of splitting on the way down, one notes the path one took on the way down and splits only if leaf tier is ready to split on a second pass. There will come a time when the root tier is populated. At that time, the data structure is no longer concurrent. Every time we go down the tree for insertion, we look the root, even if we find out we have to unlock it before we're through.

This also means that all inserts and deletes could be done under shared locks, since the updates themsleves are going to happen in the single thread, and do not dependend on the state above.

Thus, we can keep the sizes of the tiers above us and create a split plan. The most common split plan will be, no split needed. When a split is needed, we descend the tree locking tiers according to our split plan. When we reach a tier that has a different number of records than when we created the split plan, we know our split plan is junk, and we ignore. But, all along we were building a new split plan, just in case, so we hold our locks (rather than unwinding them) and continue down the tree. We check to see if we have a new split plan that requires splits. If so, we undo all our locks and go down the tree again. We repeat until we actually perform a split plan or no split plan is necessary.

This is going to be simplier because it will not require unwinding locks. We're already doing the retried descent trick to reap lonely tiers, so at some point I already decided that this was okay. And it is okay, since that path needing a split is going to be hot, in memory

Funny, but the trolls in my head are screaming, "DRY, DRY, DRY!" Except that I'm not repeating myself, the algorithm is. The trolls in my head, the inner troll. That is like the animal, Resistance, in the War of Art. They've been able to guide my development when deveoping in Java.


Methodologies devolve into superstition.


When I look at what Strata was, everytime to I have to dip into the Java to remind myself of something, I cringe. All those artifacts. All those many little classes that exist solely to preserve type safety. How are you to read a program written in this way? You can't print it. It's a hundred pages, most of which is interface definition and re-definition. Dozens of little classes that serve as type shims and type transforms. Patterns, patterns, patterns. The algorithm is impossible to see.


Now it makes a lot of sense, the secondary descent or the split plan, since I can now see how a primary index can act as its own write-ahead log, with our punt to the file system strategy. The records get appended to the last leaf, which means that it is essentially appeneding to a file. We can sync this frequesntly to get our log.

We don't have any locks, really, until the last page is full, and then its time to split, but at that point, we can really stop the show, since splits are infrequent compared to inserts, and merges in this use case are almost never.

When we lock to stop the show, out leaves are already flushed. We only need to sort out the inner tiers, then flush those.

We can keep a reference to the first leaf, which is never changed. In the current algorithm, it will always exist, being the default if the child is empty, so our tree is a built on top of a linked list that is our write ahead log.

It is not going to cost anything to keep a reference to the parent in storage, so we can use that find our way back up the tree during a recovery. We could even keep a log of tree operations prior to manipulating the tree. In fact, that's what we do, write out our operations to a temporary file. Rename the file. Perform the operations. Unlink the file. Expensive, but infrequnet and very easy to comprehend.

Makes it difficult to continue with this current implementation which is so complicated.

If we are storing in file sorted, using memory mapped files, writes are going to be much more fragile. Levels of indirection, the tail of the tier is the positions of the ordered records, while the records themselves are appended, at times stomping on the ordered part, causing it to be rewritten. A bad page will only mean sorting out that page.

I can't imagine a world in which you wouldn't hand over the minutae to the file system. Attempting to manage blocks yourself within a larger file is a losing proposition, or an area of study for someone who was bitten by the problem.

Memory mapped files, then, would have two files for each page, records and order. Order is getting shuffled around constantly, while the records are not. The records can be non-binary, actually, or the could be binary.

Oh, my. With an order table, we can use JSON for record storage and use memory mapped files to get to our records and punt memory management to the kernel.

Write the record first to the file. Sync it so we know its there. Insert the position into the order table. Now, when we create a new Strata, and it opens up the page, it works it's way back from the end of the page, checking that the reocrds are in the correct spot in the order table, maybe checking a dozen or so to sure.

If locality of reference is important, then we can copy over the contents of the pgae into a new page in the order in the order table.

In fact, when the tiers are logs, we can always find the last operation on the tier, and from there we can check that the operation completed.

For example, an inner tier split will first create the right side tier, which is unlinked, so no one sees it. Fail now and it is a garbage file. At startup, we can walk back through the tier files. How do we collect it? We can write the initial size in the header and if it is that size, then we know it is very young, but it might also be an archive. We can write an id in the header, but we don't know that this is going to be incremental. We can write a success record at the very end when the operation is over? That is easier to find.

Memory mapping only really makes sense, though when we have really big files, so there may be some desire to put everything in one crazy big file. We add to the end of the file and remove from the front for garbage collection. This is a strategy for files that tend to only grow. We can make the leaves one big indexed log and memory map it, for appending trees. When the primary key is auto incremented or a timestamp, this is possible.

But, by managing inner tiers in some fashion, we're getting into creating our own cleanup, when we could rely on the filesystem.


Okay. I don't know where I'm getting the notion that memory mapping is only good for big files. Probably some Javadoc somewhere.


The bugs I'm finding are telling me that the Java code never worked. I don't believe I ever got it working again after my last "refactoring".


Write locking in the way down during an operation is the wrong way to go. We don't know if we need those locks until we complete our operation. The retry method is going to be just as fast. The retry, if it is moot is as fast as a read. It will probably fail fast, especially if we read lock, check the size, then upgrade at the first level we need to check for split or merge. There after we can be exclusive, but if that first one has changed, we leave it read and create our subseusent plan, which is probably a do-nothing plan.

Locking for mutation on the way down is the wrong thing. You know it, I know it, and the American people know it. Don't preserve it when you move in a new direction, since it will be convoluted.

Sun Apr 17 02:33:50 CDT 2011

I really need to get this away from the Java code. Looking at it, my mind keeps boggling at the complexity. All those many lines of code to preserve the type information. It really hurts to think of that complexity because it consumed so much of my time and effort.

Sat Apr 16 15:19:40 CDT 2011

Plan to punt to the filesystem on stroage, under the assumption that the filesystem is going to do all things I/O better than I can do in Node.js.

That is, no attempt to split a file into partitions. I will create a file for each tier of the tree.

With this, it gets much easier to think about safely writing data. If I have a dozen dirty pages, I can write them out to temp files, then unlink the existing file and rename the temp file.

Or, I could descide that each tier is really a log, and the contents are not sorted in file. Then I can can append records to each file, and delete records by appending a delete operation. The contents are sorted as the tier is read into memory.

That strategy presumes that a failed append will not corrupt the entire file.

That strategy also requires an occasional cleanup, but it also permits tiers with keys and contents of arbitrary length, making for clustered indexes and the like.

The implementation comes close to being a write ahead log in itself. Each append can have a nanosecond timestmap and the last record can have a flag saying it is the last record for the batch. Then we only have to write the tiers from the bottom up, so that this okay flag is actually in top most record written. To determine if a record wrote correctly, we keep a lookup of the okay timestamps at each level. We need to then keep a path to give to the I/O strategy on the way down. That I/O strategy would have to keep track of a next mutation number, and why not just log it when you log everything else?

There was no way for me to think this way when I was implementing in Java.

To vacuum a tier, write out the compressed contents into a new tier with the name tier.tmp. If there is a hard shutdown at this point, on startup discard the temporary tier. When the temp file is witten unlink the original. If there is a hard shutdown now, the missing original tells us that the temp is ready to be put in place. Rename the temp file to the primary file.

Wed Apr 13 03:22:04 CDT 2011

Dear diary,

The threaded version used a read write lock and locked on the way down. Readers can pass through other readers locks, but wait in writers. Writers wait on any lock. The evented version can do the same thing, really, but you'd have to be clever about callbacks. A reader would find a tier that has something going on and would hang a callback in a queue associated with that their and give up the main thread.

All this callbackery is going to be fragile. If there is some problem, the callbacks will all get flushed with an error.