Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts for NCBI Virus #16

Merged
merged 13 commits into from
Aug 29, 2023
Merged
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ NCBI interaction scripts that are useful for fetching public metadata and sequen

- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/) or [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.
- [ncbi-virus-url](ncbi-virus-url) - Generates the URL to download metadata and sequences from NCBI Virus as a single CSV file.

Potential Nextstrain CLI scripts

Expand Down
95 changes: 95 additions & 0 deletions ncbi-virus-url
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
#!/usr/bin/env python3
"""
Generate URL to download all virus sequences and their curated metadata for a
specified NCBI Taxon ID from GenBank via NCBI Virus.

The URL this program builds is based on the URL for SARS-CoV-2 constructed with

https://github.com/nextstrain/ncov-ingest/blob/2a5f255329ee5bdf0cabc8b8827a700c92becbe4/bin/genbank-url

and observing the network activity at

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide
"""
from urllib.parse import urlencode
from typing import List, Optional
import argparse

def parse_args():
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--ncbi-taxon-id", required=True,
help="NCBI Taxon ID. Visit NCBI virus at " +
"https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/find-data/virus " +
"to search for supported taxon IDs.")
parser.add_argument("--filters", required=False, nargs="*",
help="Filter criteria to add as `fq` param values. " +
"Apply filters via the NCBI Virus UI and observe the network " +
"activity to find the desired filter string.")
Comment on lines +24 to +27
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our latest meeting with WA DOH, people were concerned whether the clean data that Nextstrain hosts at data.nextstrain.org/files might exclude certain records from NCBI.

Allowing customizable filters here might add to that concern, but there's already so many other points in ingest pipelines that might filter out records. I think we just need to be better about documenting these filters that are used to generate the clean data.

return parser.parse_args()

def build_query_url(ncbi_taxon_id: str, filters: Optional[List[str]]=None):
"""
Generate URL to download all viral sequences and their curated metadata
from GenBank via NCBI Virus.
"""
endpoint = "https://www.ncbi.nlm.nih.gov/genomes/VirusVariation/vvsearch2/"
params = {
# Search criteria
'fq': [
'{!tag=SeqType_s}SeqType_s:("Nucleotide")', # Nucleotide sequences (as opposed to protein)
f'VirusLineageId_ss:({ncbi_taxon_id})',
*(filters or []),
],

# Unclear, but seems necessary.
'q': '*:*',

# Response format
'cmd': 'download',
'dlfmt': 'csv',
'fl': ','.join(
':'.join(names) for names in [
# Pairs of (output column name, source data field).
('genbank_accession', 'id'),
('genbank_accession_rev', 'AccVer_s'),
('database', 'SourceDB_s'),
('strain', 'Isolate_s'),
('region', 'Region_s'),
('location', 'CountryFull_s'),
('collected', 'CollectionDate_s'),
('submitted', 'CreateDate_dt'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding updated as a central output column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would updated be the date the most recent revision was released, as opposed to the first revision, which is submitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's my interpretation of the field. I can't find any official documentation though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in d304c91

('length', 'SLen_i'),
('host', 'Host_s'),
('isolation_source', 'Isolation_csv'),
('bioproject_accession', 'BioProject_s'),
('biosample_accession', 'BioSample_s'),
('sra_accession', 'SRALink_csv'),
('title', 'Definition_s'),
('authors', 'Authors_csv'),
('submitting_organization', 'SubmitterAffilFull_s'),
('publications', 'PubMed_csv'),
('sequence', 'Nucleotide_seq'),
]
),

# Stable sort with GenBank accessions.
# Columns are source data fields, not our output columns.
'sort': 'id asc',

# This isn't Entrez, but include the same email parameter it requires just
# to be nice.
'email': '[email protected]',
}
query = urlencode(params, doseq = True, encoding = "utf-8")

print(f"{endpoint}?{query}")

def main():
args = parse_args()
build_query_url(
ncbi_taxon_id=args.ncbi_taxon_id,
filters=args.filters,
)

if __name__ == '__main__':
main()