-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record #77
Comments
Oh, that's unfortunate. Pity we can't use merge operators from Java - doing a get before each put seems likely to have a significant effect on bulk data load performance. I guess we'd only need to do the lookup for revisits though at least. |
We've also observed this with redirects, where the This is breaking things for us, and because we do large crawls spread over many WARCs, I'm not sure how to fix it. I guess maybe a second pass on the indexer that only posts 200s? |
I guess another solution would be to move the original url into the database key to break the tie. That'd mean all of them would be returned in search results and it'd then be up to wayback to do something sensible in face of the collision. |
FWIW, based on the not-wholly-unrelated ukwa/ukwa-pywb#53 I think EDIT but I guess we'd also need to pull in the MIME type to avoid the original |
Well galgeek's original example had two different URLs (player_id=1 vs player_id=2) that were only the same under fuzzy rules. But yes an exact revisit within the same second does sound plausible, particularly in browser-based crawling scenario. If we add the content-type the question simply becomes is it a problem if revisits clobber each other? I guess the real unique identifier for a record is the combination of warc filename and byte offset. If those are the same then its the same record and we should clobber. If they differ then it's not and we should list it separately, even if everything else is the same. So...
How to handle the upgrade will need some thought. Edit: Changed v1/v2 to v3/v4 so we have one record version scheme rather than separate version numbers for keys and values. v0,1,2,3 all use the same key format. |
I like this idea.
|
Making the system handle upgrades nicely is going to be fairly complicated. Would it help to stage the process, e.g. starting by implementing this as a start-up/CLI option where you can choose the longer key but you need to use separate index files, and looking into migration later on once we've got the thing working? |
@anjackson see #78 for roughly what I had in mind (without migration logic). I haven't really tested it much yet. |
I think we could implement an online upgrade by roughly this process:
I'm not sure what the performance or disk space implications are for doing this. Mass deletes in RocksDB are a little odd due to tombstones. If an excess of tombstones becomes a problem during migration it might be necessary to periodically call CompactRange too. |
I observed this with fuzzy match handling, not sure whether this is an issue for other records.
The original records:
Only one record returned from outbackcdx after ingest:
Note that I was able to update this record properly by re-posting only the status 200 record.
The text was updated successfully, but these errors were encountered: