add a function to rebuild from all known intids... all 4.7MM of them. #2

cutz · 2019-08-14T03:49:29Z

No description provided.

jamadden · 2019-08-14T03:55:59Z

This is probably aa good place for savepoints https://transaction.readthedocs.io/en/latest/savepoint.html

jamadden · 2019-08-14T04:04:32Z

It suddenly occurs to me that relstorage OIDs are in fact intids, with essentially no overhead and no risk of conflict, so long as you only need to collect persistent objects. Map implementations for most BTree families could be built directly on relstorage primitives. But that’s for another time.

cutz · 2019-08-14T04:10:44Z

@jamadden maybe something like this? How much overhead does that add?

jamadden · 2019-08-14T07:13:54Z

I hadn't considered a savepoint for each individual indexed item; I'd only ever seen it in bulk. That is an elegant way to handle sub-failures! Mayhap's that's what nti.metadata needs to do.

As for overhead: Until the transaction is finally committed, each time you push another savepoint, the state of all the objects that were changed within the previous savepoint are serialized and saved to a tempfile. Now, those are all the same objects that would ultimately have to be serialized to be stored in the database (and guess where RelStorage keeps them temporarily? A tempfile!). However, to the extent that you're modifying many of the same objects repeatedly (internal BTree nodes), that's extra serialization and eventually IO. Worst case, we go from O(modified_objects) to O(modified_objects ^ 2). It shouldn't be anywhere near that bad --- the whole point of BTrees is to avoid writing to the same object too many times --- but there will be some impact.

You could be optimistic and do savepoints in batches. If you did that, you'd want to save each group that failed and re-run those individually at the end. If your optimism is warranted that could reduce the overhead by a huge factor.

Batching is a good idea for another reason: using the storage prefetch API to bulk load the objects. That's so much faster than a series of individual calls. Unfortunately, because of zopefoundation/ZODB#277 you'll have to poke at internals a bit.

intids = ...
objects_from_broken_batches = []

object_iter = intids.refs.itervalues()
def take(): 
    return list(islice(object_iter, BATCH_SIZE)) or None # itertools recipe

def index_batch(batch, enter=lambda: None, ex_handle=None):
    intids._p_jar._normal_storage.prefetch(batch)
    for obj in batch:
        entered = enter()
        try:
            docid = intids.getId(obj) # or directly access the attr for speed
            index_one(obj)
        except POSKeyError:
            # gone, no point. TODO: Remove from intids 
            batch.remove(obj) # TODO: but watch concurrent iteration error
            continue
        except Exception:
            if ex_handle: 
                ex_handle(entered)
            else: raise
              

for batch in iter(take, None):
    savepoint = transaction.savepoint()
    try:
        index_batch(batch)
    except Exception:
        savepoint.rollback()
        objects_from_broken_batches.extend(batch)

index_batch(objects_from_broken_batches,
            transaction.savepoint,
            lambda sp: sp.rollback())

cutz · 2019-08-14T14:42:28Z

That makes sense. Turns out save pointing each object added a tremendous amount of overhead. A batching approach as you detailed above looks like the best approach to me.

add a function to rebuild from all known intids... all 4.7MM of them.

4119e5c

savepoint each doc

d53d541

cutz closed this Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a function to rebuild from all known intids... all 4.7MM of them. #2

add a function to rebuild from all known intids... all 4.7MM of them. #2

cutz commented Aug 14, 2019

jamadden commented Aug 14, 2019

jamadden commented Aug 14, 2019

cutz commented Aug 14, 2019

jamadden commented Aug 14, 2019

cutz commented Aug 14, 2019

add a function to rebuild from all known intids... all 4.7MM of them. #2

add a function to rebuild from all known intids... all 4.7MM of them. #2

Conversation

cutz commented Aug 14, 2019

jamadden commented Aug 14, 2019

jamadden commented Aug 14, 2019

cutz commented Aug 14, 2019

jamadden commented Aug 14, 2019

cutz commented Aug 14, 2019