resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record #77

galgeek · 2020-01-16T01:14:38Z

I observed this with fuzzy match handling, not sure whether this is an issue for other records.

The original records:

com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=1&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=1&rpc_init=1&conviva_environment=test text/html 200 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 3163 294009561 ...
com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=2&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704 ...

Only one record returned from outbackcdx after ingest:

fuzzy:com,twitter)/i/videos/tweet/731859894710718465? 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704  ...

Note that I was able to update this record properly by re-posting only the status 200 record.

The text was updated successfully, but these errors were encountered:

ato · 2020-01-16T03:24:58Z

Oh, that's unfortunate. Pity we can't use merge operators from Java - doing a get before each put seems likely to have a significant effect on bulk data load performance. I guess we'd only need to do the lookup for revisits though at least.

anjackson · 2020-01-20T10:28:40Z

We've also observed this with redirects, where the 30x overwrites a 200 record for the same key (mostly http->https redirects) because the indexer happened to process the WARC records in that order.

This is breaking things for us, and because we do large crawls spread over many WARCs, I'm not sure how to fix it. I guess maybe a second pass on the indexer that only posts 200s?

ato · 2020-01-20T10:39:43Z

I guess another solution would be to move the original url into the database key to break the tie. That'd mean all of them would be returned in search results and it'd then be up to wayback to do something sensible in face of the collision.

anjackson · 2020-01-20T11:41:49Z

FWIW, based on the not-wholly-unrelated ukwa/ukwa-pywb#53 I think pywb expects to resolve these collisions.

EDIT but I guess we'd also need to pull in the MIME type to avoid the original warc/revisit collision.

ato · 2020-01-20T15:52:53Z

Well galgeek's original example had two different URLs (player_id=1 vs player_id=2) that were only the same under fuzzy rules. But yes an exact revisit within the same second does sound plausible, particularly in browser-based crawling scenario. If we add the content-type the question simply becomes is it a problem if revisits clobber each other?

I guess the real unique identifier for a record is the combination of warc filename and byte offset. If those are the same then its the same record and we should clobber. If they differ then it's not and we should list it separately, even if everything else is the same.

So...

keyv3 = (urlkey, timestamp)
keyv4 = (urlkey, timestamp, filename, offset)

How to handle the upgrade will need some thought.

Edit: Changed v1/v2 to v3/v4 so we have one record version scheme rather than separate version numbers for keys and values. v0,1,2,3 all use the same key format.

nlevitt · 2020-01-21T00:14:15Z

I like this idea.

keyv2 = (urlkey, timestamp, filename, offset)

anjackson · 2020-02-17T11:21:15Z

Making the system handle upgrades nicely is going to be fairly complicated. Would it help to stage the process, e.g. starting by implementing this as a start-up/CLI option where you can choose the longer key but you need to use separate index files, and looking into migration later on once we've got the thing working?

ato · 2020-02-17T13:27:18Z

@anjackson see #78 for roughly what I had in mind (without migration logic). I haven't really tested it much yet.

ato · 2020-02-17T13:45:48Z

I think we could implement an online upgrade by roughly this process:

Temporarily until the index has a 'v4 migration complete' flag:
- PUT should issue a delete for the v3 key and then put the v4 record.
- DELETE should issue deletes for both the v3 and v4 key.
A background thread should walk the entire index rewriting every v3 record to v4. After all records have been visited, flag the index as 'v4 migration complete'. Might be a good idea to periodically save the key the migration is up to so that it can resume if the process is restarted.

I'm not sure what the performance or disk space implications are for doing this. Mass deletes in RocksDB are a little odd due to tombstones. If an excess of tombstones becomes a problem during migration it might be necessary to periodically call CompactRange too.

anjackson mentioned this issue Feb 17, 2020

Improve CDX indexing, especially of revisit records ukwa/webarchive-discovery#215

Closed

ato mentioned this issue Feb 17, 2020

Add index version 4 for distinct records with identical url & date #78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record #77

resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record #77

galgeek commented Jan 16, 2020

ato commented Jan 16, 2020

anjackson commented Jan 20, 2020

ato commented Jan 20, 2020 •

edited

Loading

anjackson commented Jan 20, 2020 •

edited

Loading

ato commented Jan 20, 2020 •

edited

Loading

nlevitt commented Jan 21, 2020

anjackson commented Feb 17, 2020

ato commented Feb 17, 2020

ato commented Feb 17, 2020 •

edited

Loading

resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record #77

resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record #77

Comments

galgeek commented Jan 16, 2020

ato commented Jan 16, 2020

anjackson commented Jan 20, 2020

ato commented Jan 20, 2020 • edited Loading

anjackson commented Jan 20, 2020 • edited Loading

ato commented Jan 20, 2020 • edited Loading

nlevitt commented Jan 21, 2020

anjackson commented Feb 17, 2020

ato commented Feb 17, 2020

ato commented Feb 17, 2020 • edited Loading

ato commented Jan 20, 2020 •

edited

Loading

anjackson commented Jan 20, 2020 •

edited

Loading

ato commented Jan 20, 2020 •

edited

Loading

ato commented Feb 17, 2020 •

edited

Loading