-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update links for GISAID and GenBank accessions #485
Conversation
Adds a rule to fetch GISAID and GenBank accession links from UCSC and uses the output in the `transform_genbank` and `transform_gisaid` rules. The UCSC file is modified to keep the headers and format exactly the same as the current `source-data/accessions.tsv.gz` file so that it can be directly replaced.
Brief look at the trial run output metadata
GISAID
GenBank
The number of removals ( |
Concatenates the downloaded accession links file to the existing `source-data/accessions.tsv.gz` since the downloaded file seems to be missing some of the existing links.¹ The transform scripts apply the accession links in order, so the last matching accession link is used in the final metadata. This allows us to default to the downloaded file which is the latest data. ¹ <#485 (comment)>
4d5a867
to
fd01604
Compare
Looking at the new trial metadata GISAID
GenBank
I think it's reasonable to default to the latest accession links in the downloaded UCSC file. If there are accession links that should be manually corrected, we can do so with annotations. |
Glad you're using the latest now and sorry I fell behind with the manual updates! Feel free to send me examples of any changes/omissions that look concerning. When the same sequence has been submitted multiple times to GISAID or GenBank (and especially when it has been submitted multiple times to both), there can be some instability from one run to the next regarding which of multiple possible IDs get paired up, which would explain some changes. Also, over time, some sequences have updates submitted, or are withdrawn from one repo or the other (or both). |
Concatenates the downloaded accession links file to the existing `source-data/accessions.tsv.gz` since the downloaded file seems to be missing some of the existing links.¹ The transform scripts apply the accession links in order, so the last matching accession link is used in the final metadata. This allows us to default to the downloaded file which is the latest data. ¹ <#485 (comment)>
Allows overrides of accessions if there is something that we want to manually correct in the accession links.
fd01604
to
fcb407c
Compare
Thank you @AngieHinrichs, both the manual and automatic accession links are extremely helpful ❤️
This makes sense to me, but @trvrb please let me know if you'd like the order to be switched here so that we can have stable accession links, i.e. append the |
I think it makes sense as you say to have the linkage be the best current attempt at linkage and it's okay if this is not fully stable. |
If there's no other feedback, I'll merge this tomorrow morning before the automated run. |
Description of proposed changes
Adds a rule to fetch GISAID and GenBank accession links from UCSC
and uses the output in the
transform_genbank
andtransform_gisaid
rules.
The UCSC file is modified to keep the headers and format exactly the
same as the current
source-data/accessions.tsv.gz
file so that it canbe directly replaced.
Related issue(s)
Resolves #484
Checklist