-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scripts for NCBI Virus #16
Conversation
Copied from https://github.com/nextstrain/monkeypox/blob/12ccb428aabccd9271fe8c5e61533edd656474e4/ingest/bin/genbank-url The original script in ncov-ingest has already been removed because of the switch over to using NCBI Datasets CLI,¹ so I'm copying from the actively maintained version of the script in monkeypox. Subsequent copies of the script in other pathogen repos have all made changes to the URL params, which I will generalize in subsequent commits. ¹ nextstrain/ncov-ingest@60d603f
There are too many potential filters to add as individual options, so add a general filter option to allow custom filter criteria. Added based on the changes made in the dengue repo https://github.com/nextstrain/dengue/blob/a6df20f0e8540a60a403eab81d80e517d1e31ee0/ingest/bin/genbank-url#L41
We have a set of core fields that are shared across viruses, but certain pathogens may require additional fields from NCBI Virus. Added based on the changes made in the dengue repo https://github.com/nextstrain/dengue/blob/a6df20f0e8540a60a403eab81d80e517d1e31ee0/ingest/bin/genbank-url#L60
Copied from https://github.com/nextstrain/ncov-ingest/blob/6c7ae5d7d259d62b8a0f016f02be51a11090592a/bin/csv-to-ndjson Subsequent copies of the script that are functionally identical: - https://github.com/nextstrain/monkeypox/blob/12ccb428aabccd9271fe8c5e61533edd656474e4/ingest/bin/csv-to-ndjson - https://github.com/nextstrain/dengue/blob/a6df20f0e8540a60a403eab81d80e517d1e31ee0/ingest/bin/csv-to-ndjson - https://github.com/nextstrain/zika/blob/7ec737b8d249302e9dc1e26f42e32696781e1488/ingest/bin/csv-to-ndjson Copy in the rsv repo¹ has been modified to use argparse to take in input and output option, which I do not think are necessary to port over to the centralized script which will just pipe data through stdin/stdout: Copy in the hepatitisB repo² has been modified to convert TSV to NDJSON by modifying the delimiter. I don't think we need to port over this changes since we can drop the `tsv-to-ndjson` script if we modify the workflow as described in nextstrain/hbv#9. ¹ https://github.com/nextstrain/rsv/blob/75d4c07c0b17693dd19b137a317cb24c434b83f3/ingest/bin/csv-to-ndjson.py ² https://github.com/nextstrain/hepatitisB/blob/1f38f623d493bbafc25a4ab60226a4bd59ef8f6d/ingest/scripts/tsv-to-ndjson.py
Copied from https://github.com/nextstrain/monkeypox/blob/12ccb428aabccd9271fe8c5e61533edd656474e4/ingest/bin/fetch-from-genbank The original script in ncov-ingest has already been removed because of the switch over to using NCBI Datasets CLI,¹ so I'm copying the actively maintained version of the script in monkeypox. I had considered going back to using a single Python script to fetch the data, but found that the requests library would still run into a ChunkedEncodingError as previously explained in nextstrain/ncov-ingest@6af81e7. Seems more reliable to just stick with curl. The following commit will update to use the new options for the ncbi-virus-url script. ¹ nextstrain/ncov-ingest@60d603f
('region', 'Region_s'), | ||
('location', 'CountryFull_s'), | ||
('collected', 'CollectionDate_s'), | ||
('submitted', 'CreateDate_dt'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding updated
as a central output column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would updated
be the date the most recent revision was released, as opposed to the first revision, which is submitted
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's my interpretation of the field. I can't find any official documentation though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in d304c91
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just a few non-blocking comments.
('region', 'Region_s'), | ||
('location', 'CountryFull_s'), | ||
('collected', 'CollectionDate_s'), | ||
('submitted', 'CreateDate_dt'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would updated
be the date the most recent revision was released, as opposed to the first revision, which is submitted
?
@@ -73,6 +73,9 @@ NCBI interaction scripts that are useful for fetching public metadata and sequen | |||
|
|||
- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file. | |||
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/) or [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs. | |||
- [fetch-from-ncbi-virus](fetch-from-ncbi-virus) - Fetch metadata and nucleotide sequences from [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) and output NDJSON records to stdout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know of a data dictionary for all the csv/ndjson fields returned? The "all-fields" example is amazing, but it would be great if we had a bit more information about what they represent - in case this exists, maybe it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not really any documentation since this is not an official API. The NCBI Virus help page is the closest I can find, but that uses the "pretty" field names that they display on the web UI.
parser.add_argument("--filters", required=False, nargs="*", | ||
help="Filter criteria to add as `fq` param values. " + | ||
"Apply filters via the NCBI Virus UI and observe the network " + | ||
"activity to find the desired filter string.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our latest meeting with WA DOH, people were concerned whether the clean data that Nextstrain hosts at data.nextstrain.org/files might exclude certain records from NCBI.
Allowing customizable filters here might add to that concern, but there's already so many other points in ingest pipelines that might filter out records. I think we just need to be better about documenting these filters that are used to generate the clean data.
Switch over to the new ncbi-virus-url script that allows customizations of filters and fields.
Co-authored-by: Cornelius Roemer <[email protected]>
Useful for knowing when the record was last updated in NCBI, which @j23414 noted to be the "Modification date".¹ This will ultimately add a new field to the NDJSON output from fetch-from-ncbi-virus, but it should not affect the final metadata output of the ingest pipelines. Ingest pipelines usually define the final metadata output columns in the config and would need to be updated separately. ¹ nextstrain/mpox#104 (comment)
Add github_repo argument to allow customization of the User-Agent request header. Following the pattern of other scripts in this repo that requires the repo owner and name.
d304c91
to
aee5613
Compare
Merging to use as part of https://github.com/nextstrain/pathogen-repo-template. |
Description of proposed changes
Add scripts for downloading metadata and sequences from NCBI Virus.
See commits for details.
Related issue(s)
Subset of scripts listed in #1
Checklist