-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle invalid dates #83
Comments
@kris-sigur reports that truncated timestamps still cause problems with sorting: |
I suspect you have avoided the sorting problems so far as generally shorter timestamps = older timestamps. Unfortunately, the data I have from IA comes from a mix of crawls over time. I suggest padding out shorter timestamps on ingest. As for longer timestamps ... you can get all sorts of weirdness if you start truncating them. The order you feed them in will determine how collisions are handled, for example. So I suggest rejecting those exceeding 14 characters for now. It is also worth noting that longer timestamps are poorly defined at the moment. The ISO standard has milliseconds as a decimal fraction of the second, and the number of digits being undefined. So probably best to treat timestamps over 14 chars as invalid for now. I've written up something along those lines and I'll submit a PR later today, or tomorrow, once I've tested it out a bit more. As for existing indexes. I'm deleting all the records from mine before updating. It is easy in my case. I know which set of CDXs they come from. A full rebuild wouldn't have been overly onerous either. I'm sure this varies. I expect padding on ingest wont break the older indexes, but it will likely become impossible to delete those older items if the delete function also runs through the same padding procedure. One possible option (no idea how practical this is), would be to have a housekeeping task that looks for timestamps that are "under" 14 digits long and then multiply them with 10, 100 etc. depending on the range they fall in and updates the index. I guess this is only really viable if you can do a query based solely on the timestamp. |
@galgeek Any thoughts on this? Kris' PR seems reasonable to me but I don't have any indexes with truncated dates so I'm not really affected by this change. |
apologies for my delay in responding, and Kris' PR looks fine to me! I've extracted most, if not all, of AIT's shorter timestamps, padded them with 0, to 14 chars, and reingested. |
Currently we allow out of range dates to be inserted. This doesn't normally cause problems as the dates aren't parsed. However the access points system does need interpret the dates which causes queries to return no results.
One option is to pad/truncate them to something parsable in #82. Another is to just outright reject them upfront.
The text was updated successfully, but these errors were encountered: