-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scripts for NCBI Virus #16
Changes from 4 commits
9e32477
ff44483
2e91981
d98fcb7
f952648
a20c086
0065a38
39a3364
9d0e4e4
a44f954
ec2a692
c5ad366
aee5613
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
#!/usr/bin/env python3 | ||
""" | ||
Generate URL to download all virus sequences and their curated metadata for a | ||
specified NCBI Taxon ID from GenBank via NCBI Virus. | ||
|
||
The URL this program builds is based on the URL for SARS-CoV-2 constructed with | ||
|
||
https://github.com/nextstrain/ncov-ingest/blob/2a5f255329ee5bdf0cabc8b8827a700c92becbe4/bin/genbank-url | ||
|
||
and observing the network activity at | ||
|
||
https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide | ||
""" | ||
from urllib.parse import urlencode | ||
from typing import List, Optional | ||
import argparse | ||
|
||
def parse_args(): | ||
parser = argparse.ArgumentParser(description=__doc__) | ||
parser.add_argument("--ncbi-taxon-id", required=True, | ||
help="NCBI Taxon ID. Visit NCBI virus at " + | ||
"https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/find-data/virus " + | ||
"to search for supported taxon IDs.") | ||
parser.add_argument("--filters", required=False, nargs="*", | ||
help="Filter criteria to add as `fq` param values. " + | ||
"Apply filters via the NCBI Virus UI and observe the network " + | ||
"activity to find the desired filter string.") | ||
return parser.parse_args() | ||
|
||
def build_query_url(ncbi_taxon_id: str, filters: Optional[List[str]]=None): | ||
""" | ||
Generate URL to download all viral sequences and their curated metadata | ||
from GenBank via NCBI Virus. | ||
""" | ||
endpoint = "https://www.ncbi.nlm.nih.gov/genomes/VirusVariation/vvsearch2/" | ||
params = { | ||
# Search criteria | ||
'fq': [ | ||
'{!tag=SeqType_s}SeqType_s:("Nucleotide")', # Nucleotide sequences (as opposed to protein) | ||
f'VirusLineageId_ss:({ncbi_taxon_id})', | ||
*(filters or []), | ||
], | ||
|
||
# Unclear, but seems necessary. | ||
'q': '*:*', | ||
|
||
# Response format | ||
'cmd': 'download', | ||
'dlfmt': 'csv', | ||
'fl': ','.join( | ||
':'.join(names) for names in [ | ||
# Pairs of (output column name, source data field). | ||
('genbank_accession', 'id'), | ||
('genbank_accession_rev', 'AccVer_s'), | ||
('database', 'SourceDB_s'), | ||
('strain', 'Isolate_s'), | ||
('region', 'Region_s'), | ||
('location', 'CountryFull_s'), | ||
('collected', 'CollectionDate_s'), | ||
('submitted', 'CreateDate_dt'), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider adding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, that's my interpretation of the field. I can't find any official documentation though... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added in d304c91 |
||
('length', 'SLen_i'), | ||
('host', 'Host_s'), | ||
('isolation_source', 'Isolation_csv'), | ||
('bioproject_accession', 'BioProject_s'), | ||
('biosample_accession', 'BioSample_s'), | ||
('sra_accession', 'SRALink_csv'), | ||
('title', 'Definition_s'), | ||
('authors', 'Authors_csv'), | ||
('submitting_organization', 'SubmitterAffilFull_s'), | ||
('publications', 'PubMed_csv'), | ||
('sequence', 'Nucleotide_seq'), | ||
] | ||
), | ||
|
||
# Stable sort with GenBank accessions. | ||
# Columns are source data fields, not our output columns. | ||
'sort': 'id asc', | ||
|
||
# This isn't Entrez, but include the same email parameter it requires just | ||
# to be nice. | ||
'email': '[email protected]', | ||
} | ||
query = urlencode(params, doseq = True, encoding = "utf-8") | ||
|
||
print(f"{endpoint}?{query}") | ||
|
||
def main(): | ||
args = parse_args() | ||
build_query_url( | ||
ncbi_taxon_id=args.ncbi_taxon_id, | ||
filters=args.filters, | ||
) | ||
|
||
if __name__ == '__main__': | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our latest meeting with WA DOH, people were concerned whether the clean data that Nextstrain hosts at data.nextstrain.org/files might exclude certain records from NCBI.
Allowing customizable filters here might add to that concern, but there's already so many other points in ingest pipelines that might filter out records. I think we just need to be better about documenting these filters that are used to generate the clean data.