Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a function to rebuild from all known intids... all 4.7MM of them. #2

Closed
wants to merge 2 commits into from

Conversation

cutz
Copy link

@cutz cutz commented Aug 14, 2019

No description provided.

@jamadden
Copy link
Member

This is probably aa good place for savepoints https://transaction.readthedocs.io/en/latest/savepoint.html

@jamadden
Copy link
Member

It suddenly occurs to me that relstorage OIDs are in fact intids, with essentially no overhead and no risk of conflict, so long as you only need to collect persistent objects. Map implementations for most BTree families could be built directly on relstorage primitives. But that’s for another time.

@cutz
Copy link
Author

cutz commented Aug 14, 2019

@jamadden maybe something like this? How much overhead does that add?

@jamadden
Copy link
Member

I hadn't considered a savepoint for each individual indexed item; I'd only ever seen it in bulk. That is an elegant way to handle sub-failures! Mayhap's that's what nti.metadata needs to do.

As for overhead: Until the transaction is finally committed, each time you push another savepoint, the state of all the objects that were changed within the previous savepoint are serialized and saved to a tempfile. Now, those are all the same objects that would ultimately have to be serialized to be stored in the database (and guess where RelStorage keeps them temporarily? A tempfile!). However, to the extent that you're modifying many of the same objects repeatedly (internal BTree nodes), that's extra serialization and eventually IO. Worst case, we go from O(modified_objects) to O(modified_objects ^ 2). It shouldn't be anywhere near that bad --- the whole point of BTrees is to avoid writing to the same object too many times --- but there will be some impact.

You could be optimistic and do savepoints in batches. If you did that, you'd want to save each group that failed and re-run those individually at the end. If your optimism is warranted that could reduce the overhead by a huge factor.

Batching is a good idea for another reason: using the storage prefetch API to bulk load the objects. That's so much faster than a series of individual calls. Unfortunately, because of zopefoundation/ZODB#277 you'll have to poke at internals a bit.

intids = ...
objects_from_broken_batches = []

object_iter = intids.refs.itervalues()
def take(): 
    return list(islice(object_iter, BATCH_SIZE)) or None # itertools recipe

def index_batch(batch, enter=lambda: None, ex_handle=None):
    intids._p_jar._normal_storage.prefetch(batch)
    for obj in batch:
        entered = enter()
        try:
            docid = intids.getId(obj) # or directly access the attr for speed
            index_one(obj)
        except POSKeyError:
            # gone, no point. TODO: Remove from intids 
            batch.remove(obj) # TODO: but watch concurrent iteration error
            continue
        except Exception:
            if ex_handle: 
                ex_handle(entered)
            else: raise
              

for batch in iter(take, None):
    savepoint = transaction.savepoint()
    try:
        index_batch(batch)
    except Exception:
        savepoint.rollback()
        objects_from_broken_batches.extend(batch)

index_batch(objects_from_broken_batches,
            transaction.savepoint,
            lambda sp: sp.rollback())

@cutz
Copy link
Author

cutz commented Aug 14, 2019

That makes sense. Turns out save pointing each object added a tremendous amount of overhead. A batching approach as you detailed above looks like the best approach to me.

@cutz cutz closed this Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants