Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only check a sample of releases #346

Open
jpmckinney opened this issue Nov 9, 2021 · 0 comments
Open

Only check a sample of releases #346

jpmckinney opened this issue Nov 9, 2021 · 0 comments
Labels
feature Relating to loading data from the web API or CLI command performance

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Nov 9, 2021

lib-cove can only go so fast because Python's JSON Schema validators are all slow. (There are fast ones in Java and JavaScript.)

Given the size of OCDS datasets, it does not make sense to validate 100% of data. We should instead validate a sample.

I suppose the process would be for the analyst to set a sample rate. (If no sample rate is explicitly set, then the check step should be skipped.) Then, the worker would "roll the dice" on receiving each message, and only process the message if the dice roll succeeds according to the sample rate.

We can provide guidance on an appropriate sample rate, based on the size of the dataset. For some datasets, it's possible to determine the size (because the API offers a count, for example). For others, we might need to count on prior knowledge or other means.

@jpmckinney jpmckinney added the steps Relating to specific steps (transforms) label Nov 9, 2021
@jpmckinney jpmckinney added this to the V3 milestone Nov 9, 2021
@jpmckinney jpmckinney removed this from the V3 milestone Jun 8, 2022
@jpmckinney jpmckinney added feature Relating to loading data from the web API or CLI command and removed steps Relating to specific steps (transforms) labels Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Relating to loading data from the web API or CLI command performance
Projects
None yet
Development

No branches or pull requests

1 participant