Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid dates #83

Open
ato opened this issue May 18, 2020 · 4 comments
Open

Handle invalid dates #83

ato opened this issue May 18, 2020 · 4 comments

Comments

@ato
Copy link
Member

ato commented May 18, 2020

Currently we allow out of range dates to be inserted. This doesn't normally cause problems as the dates aren't parsed. However the access points system does need interpret the dates which causes queries to return no results.

One option is to pad/truncate them to something parsable in #82. Another is to just outright reject them upfront.

@ato
Copy link
Member Author

ato commented Nov 18, 2020

@kris-sigur reports that truncated timestamps still cause problems with sorting:
image
We might need to pad them at indexing time rather than just at query time. This leaves a problem for what to do about existing indexes though.

@kris-sigur
Copy link
Contributor

I suspect you have avoided the sorting problems so far as generally shorter timestamps = older timestamps. Unfortunately, the data I have from IA comes from a mix of crawls over time.

I suggest padding out shorter timestamps on ingest. As for longer timestamps ... you can get all sorts of weirdness if you start truncating them. The order you feed them in will determine how collisions are handled, for example. So I suggest rejecting those exceeding 14 characters for now.

It is also worth noting that longer timestamps are poorly defined at the moment. The ISO standard has milliseconds as a decimal fraction of the second, and the number of digits being undefined. So probably best to treat timestamps over 14 chars as invalid for now.

I've written up something along those lines and I'll submit a PR later today, or tomorrow, once I've tested it out a bit more.

As for existing indexes. I'm deleting all the records from mine before updating. It is easy in my case. I know which set of CDXs they come from. A full rebuild wouldn't have been overly onerous either. I'm sure this varies.

I expect padding on ingest wont break the older indexes, but it will likely become impossible to delete those older items if the delete function also runs through the same padding procedure.

One possible option (no idea how practical this is), would be to have a housekeeping task that looks for timestamps that are "under" 14 digits long and then multiply them with 10, 100 etc. depending on the range they fall in and updates the index. I guess this is only really viable if you can do a query based solely on the timestamp.

@ato
Copy link
Member Author

ato commented Nov 18, 2020

@galgeek Any thoughts on this? Kris' PR seems reasonable to me but I don't have any indexes with truncated dates so I'm not really affected by this change.

@galgeek
Copy link
Contributor

galgeek commented Dec 8, 2020

apologies for my delay in responding, and Kris' PR looks fine to me!

I've extracted most, if not all, of AIT's shorter timestamps, padded them with 0, to 14 chars, and reingested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants