Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record #77

Open
galgeek opened this issue Jan 16, 2020 · 9 comments

Comments

@galgeek
Copy link
Contributor

galgeek commented Jan 16, 2020

I observed this with fuzzy match handling, not sure whether this is an issue for other records.

The original records:

com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=1&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=1&rpc_init=1&conviva_environment=test text/html 200 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 3163 294009561 ...
com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=2&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704 ...

Only one record returned from outbackcdx after ingest:

fuzzy:com,twitter)/i/videos/tweet/731859894710718465? 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704  ...

Note that I was able to update this record properly by re-posting only the status 200 record.

@ato
Copy link
Member

ato commented Jan 16, 2020

Oh, that's unfortunate. Pity we can't use merge operators from Java - doing a get before each put seems likely to have a significant effect on bulk data load performance. I guess we'd only need to do the lookup for revisits though at least.

@anjackson
Copy link
Contributor

We've also observed this with redirects, where the 30x overwrites a 200 record for the same key (mostly http->https redirects) because the indexer happened to process the WARC records in that order.

This is breaking things for us, and because we do large crawls spread over many WARCs, I'm not sure how to fix it. I guess maybe a second pass on the indexer that only posts 200s?

@ato
Copy link
Member

ato commented Jan 20, 2020

I guess another solution would be to move the original url into the database key to break the tie. That'd mean all of them would be returned in search results and it'd then be up to wayback to do something sensible in face of the collision.

@anjackson
Copy link
Contributor

anjackson commented Jan 20, 2020

FWIW, based on the not-wholly-unrelated ukwa/ukwa-pywb#53 I think pywb expects to resolve these collisions.

EDIT but I guess we'd also need to pull in the MIME type to avoid the original warc/revisit collision.

@ato
Copy link
Member

ato commented Jan 20, 2020

Well galgeek's original example had two different URLs (player_id=1 vs player_id=2) that were only the same under fuzzy rules. But yes an exact revisit within the same second does sound plausible, particularly in browser-based crawling scenario. If we add the content-type the question simply becomes is it a problem if revisits clobber each other?

I guess the real unique identifier for a record is the combination of warc filename and byte offset. If those are the same then its the same record and we should clobber. If they differ then it's not and we should list it separately, even if everything else is the same.

So...

keyv3 = (urlkey, timestamp)
keyv4 = (urlkey, timestamp, filename, offset)

How to handle the upgrade will need some thought.

Edit: Changed v1/v2 to v3/v4 so we have one record version scheme rather than separate version numbers for keys and values. v0,1,2,3 all use the same key format.

@nlevitt
Copy link
Contributor

nlevitt commented Jan 21, 2020

I like this idea.

keyv2 = (urlkey, timestamp, filename, offset)

@anjackson
Copy link
Contributor

Making the system handle upgrades nicely is going to be fairly complicated. Would it help to stage the process, e.g. starting by implementing this as a start-up/CLI option where you can choose the longer key but you need to use separate index files, and looking into migration later on once we've got the thing working?

@ato
Copy link
Member

ato commented Feb 17, 2020

@anjackson see #78 for roughly what I had in mind (without migration logic). I haven't really tested it much yet.

@ato
Copy link
Member

ato commented Feb 17, 2020

I think we could implement an online upgrade by roughly this process:

  1. Temporarily until the index has a 'v4 migration complete' flag:
    • PUT should issue a delete for the v3 key and then put the v4 record.
    • DELETE should issue deletes for both the v3 and v4 key.
  2. A background thread should walk the entire index rewriting every v3 record to v4. After all records have been visited, flag the index as 'v4 migration complete'. Might be a good idea to periodically save the key the migration is up to so that it can resume if the process is restarted.

I'm not sure what the performance or disk space implications are for doing this. Mass deletes in RocksDB are a little odd due to tombstones. If an excess of tombstones becomes a problem during migration it might be necessary to periodically call CompactRange too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants