Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update links for GISAID and GenBank accessions #485

Merged
merged 3 commits into from
Dec 20, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Dec 17, 2024

Description of proposed changes

Adds a rule to fetch GISAID and GenBank accession links from UCSC
and uses the output in the transform_genbank and transform_gisaid
rules.

The UCSC file is modified to keep the headers and format exactly the
same as the current source-data/accessions.tsv.gz file so that it can
be directly replaced.

Related issue(s)

Resolves #484

Checklist

Adds a rule to fetch GISAID and GenBank accession links from UCSC
and uses the output in the `transform_genbank` and `transform_gisaid`
rules.

The UCSC file is modified to keep the headers and format exactly the
same as the current `source-data/accessions.tsv.gz` file so that it can
be directly replaced.
@joverlee521
Copy link
Contributor Author

Brief look at the trial run output metadata

  1. I subset to the gisaid_epi_isl and genbank_accession columns, only including rows where both are not "?".
zstdcat open-accession-metadata.tsv.zst \
  | csvtk -t cut -f gisaid_epi_isl,genbank_accession \
  | csvtk -t filter2 -f '$genbank_accession != ?' 
  | csvtk -t sort -k genbank_accession > open-accessions.tsv
  1. I used daff to compare the trial data with the prod data to get a summary of changes:

GISAID

@@ count
+++ 2141069
--- 81421
-> 755

GenBank

@@ count
+++ 1248778
--- 7964
-> 2424

The number of removals (---) and changes (->) makes me think that we might want to keep the original source-data/accessions.tsv.gz and just append the UCSC file.

joverlee521 added a commit that referenced this pull request Dec 17, 2024
Concatenates the downloaded accession links file to the existing
`source-data/accessions.tsv.gz` since the downloaded file seems to be
missing some of the existing links.¹

The transform scripts apply the accession links in order, so the
last matching accession link is used in the final metadata. This allows
us to default to the downloaded file which is the latest data.

¹ <#485 (comment)>
@joverlee521 joverlee521 force-pushed the update-accession-links branch from 4d5a867 to fd01604 Compare December 17, 2024 23:54
@joverlee521
Copy link
Contributor Author

Looking at the new trial metadata

GISAID

@@ count
+++ 2141428
-> 613

GenBank

@@ count
+++ 1249453
-> 2204

I think it's reasonable to default to the latest accession links in the downloaded UCSC file. If there are accession links that should be manually corrected, we can do so with annotations.

@joverlee521 joverlee521 marked this pull request as ready for review December 18, 2024 21:27
@AngieHinrichs
Copy link
Contributor

Glad you're using the latest now and sorry I fell behind with the manual updates! Feel free to send me examples of any changes/omissions that look concerning. When the same sequence has been submitted multiple times to GISAID or GenBank (and especially when it has been submitted multiple times to both), there can be some instability from one run to the next regarding which of multiple possible IDs get paired up, which would explain some changes. Also, over time, some sequences have updates submitted, or are withdrawn from one repo or the other (or both).

Concatenates the downloaded accession links file to the existing
`source-data/accessions.tsv.gz` since the downloaded file seems to be
missing some of the existing links.¹

The transform scripts apply the accession links in order, so the
last matching accession link is used in the final metadata. This allows
us to default to the downloaded file which is the latest data.

¹ <#485 (comment)>
Allows overrides of accessions if there is something that we want to
manually correct in the accession links.
@joverlee521 joverlee521 force-pushed the update-accession-links branch from fd01604 to fcb407c Compare December 19, 2024 19:56
@joverlee521
Copy link
Contributor Author

Thank you @AngieHinrichs, both the manual and automatic accession links are extremely helpful ❤️

When the same sequence has been submitted multiple times to GISAID or GenBank (and especially when it has been submitted multiple times to both), there can be some instability from one run to the next regarding which of multiple possible IDs get paired up, which would explain some changes. Also, over time, some sequences have updates submitted, or are withdrawn from one repo or the other (or both).

This makes sense to me, but @trvrb please let me know if you'd like the order to be switched here so that we can have stable accession links, i.e. append the source-data/accessions.tsv.gz file to the downloaded file so that the hard-coded links in repo have precedence over the automated ones.

@trvrb
Copy link
Member

trvrb commented Dec 19, 2024

I think it makes sense as you say to have the linkage be the best current attempt at linkage and it's okay if this is not fully stable.

@joverlee521
Copy link
Contributor Author

If there's no other feedback, I'll merge this tomorrow morning before the automated run.

@joverlee521 joverlee521 merged commit e359b7e into master Dec 20, 2024
1 check passed
@joverlee521 joverlee521 deleted the update-accession-links branch December 20, 2024 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update links for GISAID and GenBank accessions
4 participants