-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add duplicated scripts from pathogen repos #1
Comments
This moves the script to the vendored directory, which is a git subtree of https://github.com/nextstrain/ingest (currently at branch apply-geolocation-rules). The file suffix is removed to match how it appears in the other repos which use it. As per [Overview of duplicated scripts](#1) this script also appears in: * [monkeypox](https://github.com/nextstrain/monkeypox/blob/a1f0d7b757d323d87edcbe61c6c5ccfbdf47722c/ingest/bin/apply-geolocation-rules) * [rsv](https://github.com/nextstrain/rsv/blob/ba171f4a43110382c38b6154be3febd50408d7bf/ingest/bin/apply-geolocation-rules) * [dengue, branch new_ingest](https://github.com/nextstrain/dengue/blob/247b2fd897361f2548627de1d97d45fae4115c5c/ingest/bin/apply-geolocation-rules) All three of those scripts are identical to each other. The script vendored here contains two code changes (whitespace removed from diffs): **Ignore comment lines in the location-rules TSV** ```diff < if line.lstrip()[0] == '#': --- > if line.strip()=="" or line.lstrip()[0] == '#': ``` **Allow fields to be missing from the input NDJSON** The script previously mandated that the input NDJSON had all four fields (region/country/division/location). This is relaxed here, with an empty string used if the field is not present. ```diff < annotated_values = transform_geolocations(geolocation_rules, [record[field] for field in location_fields]) --- > annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields]) ```
This moves the script to the vendored directory, which is a git subtree of https://github.com/nextstrain/ingest (currently at branch apply-geolocation-rules). The file suffix is removed to match how it appears in the other repos which use it. As per [Overview of duplicated scripts](nextstrain/ingest#1) this script also appears in: * [monkeypox](https://github.com/nextstrain/monkeypox/blob/a1f0d7b757d323d87edcbe61c6c5ccfbdf47722c/ingest/bin/apply-geolocation-rules) * [rsv](https://github.com/nextstrain/rsv/blob/ba171f4a43110382c38b6154be3febd50408d7bf/ingest/bin/apply-geolocation-rules) * [dengue, branch new_ingest](https://github.com/nextstrain/dengue/blob/247b2fd897361f2548627de1d97d45fae4115c5c/ingest/bin/apply-geolocation-rules) All three of those scripts are identical to each other. The script vendored here contains two code changes (whitespace removed from diffs): **Ignore comment lines in the location-rules TSV** ```diff < if line.lstrip()[0] == '#': --- > if line.strip()=="" or line.lstrip()[0] == '#': ``` **Allow fields to be missing from the input NDJSON** The script previously mandated that the input NDJSON had all four fields (region/country/division/location). This is relaxed here, with an empty string used if the field is not present. ```diff < annotated_values = transform_geolocations(geolocation_rules, [record[field] for field in location_fields]) --- > annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields]) ```
This comment was marked as duplicate.
This comment was marked as duplicate.
@joverlee521 thanks for making the checklist in the comment above! It'll be useful to have it continually updated. To make that easier, I've moved it to the main issue text. |
In talking through #20 with @j23414, we realized that The version of the script in ncov-ingest adds clock_deviation, but that can be done separately from the joining. |
Closing issue as we have resolved all of the listed duplicate scripts. |
The first step in making this repository useful is to populate it with scripts that are currently manually copied around pathogen repos.
See shared GDoc for additional context and details on scripts.
Progress
This was originally created by @joverlee521 in #1 (comment).
Identical scripts (added in #6)
Diverged scripts with various different versions used across workflows
(binned into related groups):
Simple notify scripts (added in #8)
S3 interaction + notify scripts that depend on S3 files (added in #12)
Genbank interactions
Nextclade joining
join-metadata-and-clades (TBD)Dropping custom Python script in favor of csvtk/tsv-utils commands (Replace join metadata and clades script with csvtk and tsv append mpox#207)Potential augur curate scripts
apply-geolocation-rules
from monkeypox repo #4Summary of differences
This is the original issue text from @jameshadfield.
Here's a quick scan of duplicated ingest scripts, using monkeypox as the "base", against 4 other ingest script directories:
Directories of scripts considered:
The text was updated successfully, but these errors were encountered: