-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore cache if Nextclade or dataset version is different #466
Conversation
3367f56
to
5999035
Compare
Tested that use-nextclade-cache works as expectedSetting up the trial S3 URLUploaded a nextclade.tsv to trial S3 URL that includes the current Nextclade version + Nextclade dataset version.
Testing the workflow would use cache as expectedRan workflow up to the
Testing the renew flag still works as expected
Testing the Nextclade version check works as expected
Testing the Nextclade dataset version check works as expected
|
Doing this in preparation for adding version checks to the decision tree of whether we should use the Nextclade cache. Replaces download of the empty .renew file with just a check that the S3 object exists to limit shuffling of files.
Currently checks Nextclade and dataset versions of the first row of the nextclade.tsv file and formats them as the propose JSON. Once the version JSON file is in place, it should be easy to swap out the check for the new file.
Document why we are not using `set -euo pipefail` Co-authored-by: John SJ Anderson <[email protected]>
Avoids clash of downloaded Nextclade executable with the Nextclade command available in the environment. Includes the side-effect of the downloaded executable being removed as part of `bin/clean` when running the workflow without the `keep_all_files=True` config param. This ensures that the workflow will start from a clean slate.
9f34d75
to
9a2ca57
Compare
I plan to merge this on Monday so I can monitor the workflows during the week. |
Confirmed that yesterday's run's completed successfully after updates (GenBank and GISAID). Confirmed that the nextclade TSVs all contain a single version of Nextclade and the dataset $ aws s3 cp s3://nextstrain-ncov-private/nextclade.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version dataset_version
nextclade 3.8.2 2024-07-17--12-57-03Z
$ aws s3 cp s3://nextstrain-ncov-private/nextclade_21L.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version dataset_version
nextclade 3.8.2 2024-07-17--12-57-03Z
$ aws s3 cp s3://nextstrain-data/files/ncov/open/nextclade_21L.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version dataset_version
nextclade 3.8.2 2024-07-17--12-57-03Z
$ aws s3 cp s3://nextstrain-data/files/ncov/open/nextclade.tsv.zst - | zstd -T0 -dcq | tsv-select -H -f nextclade_version,dataset_version | tsv-uniq
nextclade_version dataset_version
nextclade 3.8.2 2024-07-17--12-57-03Z |
There hadn't been a release of Nextclade or the SARS-CoV-2 Nextclade dataset since this was merged so I wasn't able to fully confirm this was working in production... There was a release of the SARS-CoV-2 dataset on 2024-09-25 and on 2024-09-26 the automated workflows for GISAID and GenBank both ignored the cache and did a full Nextclade run as expected 🎉 |
Updating instructions as a follow up to #466.
Updating instructions as a follow up to #466.
Description of proposed changes
Update workflow to ignore the Nextclade cache if the current Nextclade version or the Nextclade dataset version is different than the version in the cache.
Currently checks Nextclade and dataset versions of the first row of the nextclade.tsv file and formats them as the proposed JSON in #458. Once the version JSON file is in place, it should be easy to swap out the check for the new file.
Related issue(s)
Resolves #457
Checklist