-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scripts for NCBI Virus #16
Merged
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
9e32477
Copy ncbi-virus-url from monkeypox
joverlee521 ff44483
ncbi-virus-url: Remove references to mpox
joverlee521 2e91981
ncbi-virus-url: update help message for `--ncbi-taxon-id`
joverlee521 d98fcb7
ncbi-virus-url: Add `--filters` option
joverlee521 f952648
ncbi-virus-url: Add `--fields` option
joverlee521 a20c086
Add example of all fields for NCBI Virus
joverlee521 0065a38
Copy csv-to-ndjson from ncov-ingest
joverlee521 39a3364
Copy fetch-from-ncbi-virus from monkeypox
joverlee521 9d0e4e4
fetch-from-ncbi-virus: use ncbi-virus-url
joverlee521 a44f954
fetch-from-ncbi-virus: Add usage docs
joverlee521 ec2a692
csv-to-ndjson: Add usage docs
joverlee521 c5ad366
ncbi-virus-url: Add `updated` as an output column
joverlee521 aee5613
fetch-from-ncbi-virus: Add github_repo argument
joverlee521 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
#!/usr/bin/env python3 | ||
""" | ||
Convert CSV on stdin to NDJSON on stdout. | ||
joverlee521 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
usage: `cat dummy.csv | ./csv-to-ndjson > dummy.ndjson` | ||
""" | ||
import csv | ||
import json | ||
from sys import stdin, stdout | ||
|
||
# 200 MiB; default is 128 KiB | ||
csv.field_size_limit(200 * 1024 * 1024) | ||
|
||
for row in csv.DictReader(stdin): | ||
json.dump(row, stdout, allow_nan = False, indent = None, separators = ',:') | ||
print() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,292 @@ | ||
{ | ||
"ExportDate_dt": "2023-08-08T21:02:01.475Z", | ||
"QualNum_i": 0, | ||
"QualPct_d": 0.0, | ||
"IncompleteCdsCnt_i": 0, | ||
"gi_l": 1798174254, | ||
"Host_s": "Homo sapiens", | ||
"HostSpecies_s": "Homo sapiens (human), taxid:9606|", | ||
"HostLineage_ss": [ | ||
"cellular organisms, taxid:131567| biota", | ||
"Eukaryota (eucaryotes), taxid:2759| eukaryotes Eucarya Eucaryotae Eukarya Eukaryotae", | ||
"Opisthokonta, taxid:33154| Fungi/Metazoa group opisthokonts", | ||
"Metazoa (metazoans), taxid:33208| multicellular animals Animalia animals", | ||
"Eumetazoa, taxid:6072|", | ||
"Bilateria, taxid:33213|", | ||
"Deuterostomia (deuterostomes), taxid:33511|", | ||
"Chordata (chordates), taxid:7711|", | ||
"Craniata, taxid:89593|", | ||
"Vertebrata (vertebrates), taxid:7742|", | ||
"Gnathostomata (jawed vertebrates), taxid:7776|", | ||
"Teleostomi, taxid:117570|", | ||
"Euteleostomi (bony vertebrates), taxid:117571|", | ||
"Sarcopterygii, taxid:8287|", | ||
"Dipnotetrapodomorpha, taxid:1338369|", | ||
"Tetrapoda (tetrapods), taxid:32523|", | ||
"Amniota (amniotes), taxid:32524|", | ||
"Mammalia (mammals), taxid:40674|", | ||
"Theria, taxid:32525|", | ||
"Eutheria (placentals), taxid:9347| eutherian mammals placental mammals Placentalia", | ||
"Boreoeutheria, taxid:1437010| Boreotheria", | ||
"Euarchontoglires, taxid:314146|", | ||
"Primates, taxid:9443| Primata primates", | ||
"Haplorrhini, taxid:376913|", | ||
"Simiiformes, taxid:314293| Anthropoidea", | ||
"Catarrhini, taxid:9526|", | ||
"Hominoidea (apes), taxid:314295| ape", | ||
"Hominidae (great apes), taxid:9604| Pongidae", | ||
"Homininae, taxid:207598| Homo/Pan/Gorilla group", | ||
"Homo (humans), taxid:9605|", | ||
"Homo sapiens (human), taxid:9606|" | ||
], | ||
"HostLineageId_ss": [ | ||
"131567", | ||
"2759", | ||
"33154", | ||
"33208", | ||
"6072", | ||
"33213", | ||
"33511", | ||
"7711", | ||
"89593", | ||
"7742", | ||
"7776", | ||
"117570", | ||
"117571", | ||
"8287", | ||
"1338369", | ||
"32523", | ||
"32524", | ||
"40674", | ||
"32525", | ||
"9347", | ||
"1437010", | ||
"314146", | ||
"9443", | ||
"376913", | ||
"314293", | ||
"9526", | ||
"314295", | ||
"9604", | ||
"207598", | ||
"9605", | ||
"9606" | ||
], | ||
"Locus_s": "NC_045512", | ||
"OrgId_i": 2697049, | ||
"VirusFamily_s": "Coronaviridae", | ||
"VirusGenus_s": "Betacoronavirus", | ||
"VirusSpecies_s": "Severe acute respiratory syndrome-related coronavirus", | ||
"VirusSpeciesId_i": 694009, | ||
"VirusLineage_ss": [ | ||
"Viruses, taxid:10239| Vira Viridae viruses", | ||
"Riboviria (RNA viruses), taxid:2559587| RNA viruses and viroids", | ||
"Orthornavirae, taxid:2732396|", | ||
"Pisuviricota, taxid:2732408|", | ||
"Pisoniviricetes, taxid:2732506|", | ||
"Nidovirales, taxid:76804|", | ||
"Cornidovirineae, taxid:2499399|", | ||
"Coronaviridae, taxid:11118|", | ||
"Orthocoronavirinae, taxid:2501931|", | ||
"Betacoronavirus, taxid:694002| Coronavirus", | ||
"Sarbecovirus, taxid:2509511|", | ||
"Severe acute respiratory syndrome-related coronavirus, taxid:694009| HCoV-SARS SARS SARSr-CoV SARSrCoV", | ||
"Severe acute respiratory syndrome coronavirus 2, taxid:2697049| SARS-CoV-2", | ||
"RNA viruses" | ||
], | ||
"VirusLineageId_ss": [ | ||
"10239", | ||
"2559587", | ||
"2732396", | ||
"2732408", | ||
"2732506", | ||
"76804", | ||
"2499399", | ||
"11118", | ||
"2501931", | ||
"694002", | ||
"2509511", | ||
"694009", | ||
"2697049" | ||
], | ||
"VirusL0_s": "RNA viruses", | ||
"VirusL1_s": "Orthornavirae, taxid:2732396", | ||
"VirusL2_s": "Pisuviricota, taxid:2732408", | ||
"VirusL3_s": "Pisoniviricetes, taxid:2732506", | ||
"VirusL4_s": "Nidovirales, taxid:76804", | ||
"VirusL5_s": "Cornidovirineae, taxid:2499399", | ||
"VirusL6_s": "Coronaviridae, taxid:11118", | ||
"VirusL7_s": "Orthocoronavirinae, taxid:2501931", | ||
"VirusL8_s": "Betacoronavirus, taxid:694002", | ||
"VirusL9_s": "Sarbecovirus, taxid:2509511", | ||
"VirusL10_s": "Severe acute respiratory syndrome-related coronavirus, taxid:694009", | ||
"ViralHost_ss": [ | ||
"human", | ||
"vertebrates" | ||
], | ||
"GenomicMoltype_s": "ssRNA(+)", | ||
"SLen_i": 29903, | ||
"Flags_ss": [ | ||
"refseq", | ||
"complete" | ||
], | ||
"Flags_csv": "refseq, complete", | ||
"FlagsCount_i": 2, | ||
"SetAcc_s": "GCF_009858895.2", | ||
"Authors_ss": [ | ||
"Wu,F.", | ||
"Zhao,S.", | ||
"Yu,B.", | ||
"Chen,Y.M.", | ||
"Wang,W.", | ||
"Song,Z.G.", | ||
"Hu,Y.", | ||
"Tao,Z.W.", | ||
"Tian,J.H.", | ||
"Pei,Y.Y.", | ||
"Yuan,M.L.", | ||
"Zhang,Y.L.", | ||
"Dai,F.H.", | ||
"Liu,Y.", | ||
"Wang,Q.M.", | ||
"Zheng,J.J.", | ||
"Xu,L.", | ||
"Holmes,E.C.", | ||
"Zhang,Y.Z.", | ||
"Baranov,P.V.", | ||
"Henderson,C.M.", | ||
"Anderson,C.B.", | ||
"Gesteland,R.F.", | ||
"Atkins,J.F.", | ||
"Howard,M.T.", | ||
"Robertson,M.P.", | ||
"Igel,H.", | ||
"Baertsch,R.", | ||
"Haussler,D.", | ||
"Ares,M. Jr.", | ||
"Scott,W.G.", | ||
"Williams,G.D.", | ||
"Chang,R.Y.", | ||
"Brian,D.A.", | ||
"Chen,Y.-M.", | ||
"Song,Z.-G.", | ||
"Tao,Z.-W.", | ||
"Tian,J.-H.", | ||
"Pei,Y.-Y.", | ||
"Zhang,Y.-L.", | ||
"Dai,F.-H.", | ||
"Wang,Q.-M.", | ||
"Zheng,J.-J.", | ||
"Zhang,Y.-Z." | ||
], | ||
"Authors_csv": "Wu,F., Zhao,S., Yu,B., Chen,Y.M., Wang,W., Song,Z.G., Hu,Y., Tao,Z.W., Tian,J.H., Pei,Y.Y., Yuan,M.L., Zhang,Y.L., Dai,F.H., Liu,Y., Wang,Q.M., Zheng,J.J., Xu,L., Holmes,E.C., Zhang,Y.Z., Baranov,P.V., Henderson,C.M., Anderson,C.B., Gesteland,R.F., Atkins,J.F., Howard,M.T., Robertson,M.P., Igel,H., Baertsch,R., Haussler,D., Ares,M. Jr., Scott,W.G., Williams,G.D., Chang,R.Y., Brian,D.A., Chen,Y.-M., Song,Z.-G., Tao,Z.-W., Tian,J.-H., Pei,Y.-Y., Zhang,Y.-L., Dai,F.-H., Wang,Q.-M., Zheng,J.-J., Zhang,Y.-Z.", | ||
"AuthorsCount_i": 44, | ||
"Country_s": "China", | ||
"Isolate_s": "Wuhan-Hu-1", | ||
"Lineage_s": "B", | ||
"Division_s": "VRL", | ||
"Keywords_ss": [ | ||
"RefSeq" | ||
], | ||
"KeywordsCount_i": 1, | ||
"TaxName_s": "Severe acute respiratory syndrome coronavirus 2", | ||
"Region_s": "Asia", | ||
"ParentAcc_s": "set:NC_045512", | ||
"SetPosition_i": 0, | ||
"SourceDB_s": "RefSeq", | ||
"Definition_s": "Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome", | ||
"HostId_i": 9606, | ||
"CreateDate_dt": "2020-01-13T00:00:00Z", | ||
"CreateYear_i": 2020, | ||
"Genome_js": "[{\"id\": \"NC_045512.2\", \"segment\": null, \"proteins\": [{\"id\": \"YP_009724389.1\", \"name\": \"ORF1ab polyprotein\", \"location\": \"join(266..13468,13468..21555)\"}, {\"id\": \"YP_009725295.1\", \"name\": \"ORF1a polyprotein\", \"location\": \"266..13483\"}, {\"id\": \"YP_009724390.1\", \"name\": \"surface glycoprotein\", \"location\": \"21563..25384\"}, {\"id\": \"YP_009724391.1\", \"name\": \"ORF3a protein\", \"location\": \"25393..26220\"}, {\"id\": \"YP_009724392.1\", \"name\": \"envelope protein\", \"location\": \"26245..26472\"}, {\"id\": \"YP_009724393.1\", \"name\": \"membrane glycoprotein\", \"location\": \"26523..27191\"}, {\"id\": \"YP_009724394.1\", \"name\": \"ORF6 protein\", \"location\": \"27202..27387\"}, {\"id\": \"YP_009724395.1\", \"name\": \"ORF7a protein\", \"location\": \"27394..27759\"}, {\"id\": \"YP_009725318.1\", \"name\": \"ORF7b\", \"location\": \"27756..27887\"}, {\"id\": \"YP_009724396.1\", \"name\": \"ORF8 protein\", \"location\": \"27894..28259\"}, {\"id\": \"YP_009724397.2\", \"name\": \"nucleocapsid phosphoprotein\", \"location\": \"28274..29533\"}, {\"id\": \"YP_009725255.1\", \"name\": \"ORF10 protein\", \"location\": \"29558..29674\"}]}]", | ||
"MolType_s": "RNA", | ||
"ProtAcc_ss": [ | ||
"YP_009724389", | ||
"YP_009725295", | ||
"YP_009724390", | ||
"YP_009724391", | ||
"YP_009724392", | ||
"YP_009724393", | ||
"YP_009724394", | ||
"YP_009724395", | ||
"YP_009725318", | ||
"YP_009724396", | ||
"YP_009724397", | ||
"YP_009725255" | ||
], | ||
"ProtAccCount_i": 12, | ||
"UpdateDate_dt": "2020-07-18T00:00:00Z", | ||
"UpdateYear_i": 2020, | ||
"PubMed_ss": [ | ||
"32015508", | ||
"15680415", | ||
"15630477", | ||
"10482585" | ||
], | ||
"PubMed_csv": "32015508, 15680415, 15630477, 10482585", | ||
"PubMedCount_i": 4, | ||
"Completeness_s": "complete", | ||
"CountryFull_s": "China", | ||
"ProtNames_ss": [ | ||
"ORF1ab polyprotein", | ||
"ORF1a polyprotein", | ||
"surface glycoprotein", | ||
"ORF3a protein", | ||
"envelope protein", | ||
"membrane glycoprotein", | ||
"ORF6 protein", | ||
"ORF7a protein", | ||
"ORF7b protein", | ||
"ORF8 protein", | ||
"nucleocapsid phosphoprotein", | ||
"ORF10 protein" | ||
], | ||
"ProtNamesCount_i": 12, | ||
"IsolateParsed_s": "Wuhan-Hu-1", | ||
"NuclAcc_ss": [ | ||
"NC_045512" | ||
], | ||
"NuclAccCount_i": 1, | ||
"CollectionDate_dr": "2019-12", | ||
"CollectionYear_i": 2019, | ||
"SubmitterAffil_s": "National Center for Biotechnology Information, NIH", | ||
"BioProject_ss": [ | ||
"PRJNA485481" | ||
], | ||
"BioProject_csv": "PRJNA485481", | ||
"BioProjectCount_i": 1, | ||
"AccVer_s": "NC_045512.2", | ||
"CollectionDate_s": "2019-12", | ||
"SubmitterCountry_s": "USA", | ||
"CollectionDate_dt": "2019-12-01T00:00:00Z", | ||
"GenomeCompleteness_s": "complete", | ||
"SubmitterAffilFull_s": "National Center for Biotechnology Information, NIH", | ||
"BioProject_s": "PRJNA485481", | ||
"AccNV_s": "NC_045512", | ||
"id": "NC_045512", | ||
"SeqType_s": "Nucleotide", | ||
"FastaMD5_s": "4928f859a1822d291e0225206a0068c8", | ||
"live_i": 1, | ||
"ids_ss": [ | ||
"GCF_009858895", | ||
"GCF_009858895.2", | ||
"NC_045512", | ||
"NC_045512.2", | ||
"PRJNA485481", | ||
"YP_009724389", | ||
"YP_009724390", | ||
"YP_009724391", | ||
"YP_009724392", | ||
"YP_009724393", | ||
"YP_009724394", | ||
"YP_009724395", | ||
"YP_009724396", | ||
"YP_009724397", | ||
"YP_009725255", | ||
"YP_009725295", | ||
"YP_009725318", | ||
"set:NC_045512" | ||
], | ||
"gi_i": 1798174254, | ||
"_version_": 1773711315042304000 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
#!/bin/bash | ||
# usage: fetch-from-ncbi-virus [options] <ncbi_taxon_id> <github_repo> | ||
# | ||
# Fetch metadata and nucleotide sequences from [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) | ||
# and output NDJSON records to stdout. | ||
# | ||
# options: | ||
# | ||
# --filter=<filter_query> Filter criteria to add as `fq` param values for the NCBI Virus URL | ||
# May be specified multiple times. | ||
# | ||
# --field=<output_column_name>:<ncbi_virus_field_name> Metadata fields to add as `fl` param values for the NCBI Virus URL | ||
# May be specified multiple times. | ||
# | ||
# Originally copied from "bin/fetch-from-genbank" in nextstrain/ncov-ingest: | ||
# https://github.com/nextstrain/ncov-ingest/blob/2a5f255329ee5bdf0cabc8b8827a700c92becbe4/bin/fetch-from-genbank | ||
# | ||
set -euo pipefail | ||
|
||
bin="$(dirname "$0")" | ||
|
||
|
||
main() { | ||
declare -a filters | ||
declare -a fields | ||
|
||
for arg; do | ||
case "$arg" in | ||
--filter=*) | ||
filters+=("${arg#*=}") | ||
shift;; | ||
--field=*) | ||
fields+=("${arg#*=}") | ||
shift;; | ||
*) | ||
break;; | ||
esac | ||
done | ||
|
||
local ncbi_taxon_id="${1:?NCBI taxon id is required.}" | ||
local github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}" | ||
|
||
local ncbi_virus_url | ||
ncbi_virus_url="$("$bin"/ncbi-virus-url --ncbi-taxon-id "$ncbi_taxon_id" --filters "${filters[@]}" --fields "${fields[@]}")" | ||
|
||
fetch "$ncbi_virus_url" "$github_repo" | "$bin"/csv-to-ndjson | ||
} | ||
|
||
fetch() { | ||
curl "$1" \ | ||
--fail --silent --show-error --http1.1 \ | ||
--header "User-Agent: https://github.com/$2 ([email protected])" | ||
} | ||
|
||
main "$@" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know of a data dictionary for all the csv/ndjson fields returned? The "all-fields" example is amazing, but it would be great if we had a bit more information about what they represent - in case this exists, maybe it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not really any documentation since this is not an official API. The NCBI Virus help page is the closest I can find, but that uses the "pretty" field names that they display on the web UI.