You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Instead of the chart on the Overview page, we can have a dataset-level check for release date repetition.
Should the granularity for repetition be day, week, or month?
A common issue is to use the request time as the release date. If a spider is slow, it might take several days. In that case, grouping by week might be better (of course, it's still possible that the spider will bridge two weeks depending on when it started, but that problem will also occur if we group by month – though less frequently).
If we have a lot of spiders that take more than a week, we maybe want to group by month. I don't know if there is a real risk of causing false positives if we use a monthly granularity.
Another common issue is to use the creating time as the release date: for example, when creating historical releases. If the publisher's export process is slow, this could also take several days.
For the first case, we have two real examples: Argentina Vialidad (they have only one bulk file, same date for all the releases) and Paraguay Hacienda (slow API, but still only two different days (https://data.open-contracting.org/es/publication/62)
Should we implement this check in isolation, or combine it with a repetition of the release date checks from: contracting process timeline, milestone dates, amendment dates, document dates ?
Instead of the chart on the Overview page, we can have a dataset-level check for release date repetition.
Should the granularity for repetition be day, week, or month?
A common issue is to use the request time as the release date. If a spider is slow, it might take several days. In that case, grouping by week might be better (of course, it's still possible that the spider will bridge two weeks depending on when it started, but that problem will also occur if we group by month – though less frequently).
If we have a lot of spiders that take more than a week, we maybe want to group by month. I don't know if there is a real risk of causing false positives if we use a monthly granularity.
Another common issue is to use the creating time as the release date: for example, when creating historical releases. If the publisher's export process is slow, this could also take several days.
cc @yolile for input on the methodology to use.
The text was updated successfully, but these errors were encountered: