Skip to content

Commit

Permalink
Move apply-geolocation-rules to ingest subtree
Browse files Browse the repository at this point in the history
This moves the script to the vendored directory, which is a git subtree
of https://github.com/nextstrain/ingest (currently at branch
apply-geolocation-rules). The file suffix is removed to match how it
appears in the other repos which use it.

As per [Overview of duplicated scripts](nextstrain/ingest#1)
this script also appears in:
* [monkeypox](https://github.com/nextstrain/monkeypox/blob/a1f0d7b757d323d87edcbe61c6c5ccfbdf47722c/ingest/bin/apply-geolocation-rules)
* [rsv](https://github.com/nextstrain/rsv/blob/ba171f4a43110382c38b6154be3febd50408d7bf/ingest/bin/apply-geolocation-rules)
* [dengue, branch new_ingest](https://github.com/nextstrain/dengue/blob/247b2fd897361f2548627de1d97d45fae4115c5c/ingest/bin/apply-geolocation-rules)

All three of those scripts are identical to each other. The script vendored
here contains two code changes (whitespace removed from diffs):

**Ignore comment lines in the location-rules TSV**
```diff
< if line.lstrip()[0] == '#':
---
> if line.strip()=="" or line.lstrip()[0] == '#':
```

**Allow fields to be missing from the input NDJSON**

The script previously mandated that the input NDJSON had all four
fields (region/country/division/location). This is relaxed here, with
an empty string used if the field is not present.

```diff
< annotated_values = transform_geolocations(geolocation_rules, [record[field] for field in location_fields])
---
> annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])
```
  • Loading branch information
jameshadfield committed Jul 11, 2023
1 parent ee78e39 commit 7d507f7
Show file tree
Hide file tree
Showing 2 changed files with 1 addition and 13 deletions.
2 changes: 1 addition & 1 deletion ingest/ingest.smk
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ rule transform_metadata:
"""
ingest/scripts/tsv-to-ndjson.py < {input.metadata} |
ingest/scripts/fix_country_field.py |
ingest/scripts/apply-geolocation-rules.py --geolocation-rules ingest/config/geoLocationRules.tsv |
ingest/vendored/apply-geolocation-rules --geolocation-rules ingest/config/geoLocationRules.tsv |
ingest/scripts/add-year.py |
ingest/scripts/ndjson-to-tsv.py --metadata-columns {params.metadata_columns} --metadata {output.metadata}
"""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,6 @@
any additional transformations on top of the user curations.
"""

"""
Copied from https://github.com/nextstrain/monkeypox/blob/62fca491c6775573ad036eedf34b271b4952f2c2/ingest/bin/apply-geolocation-rules
with two changes:
First change allows missing fields in the input ndjson
- annotated_values = transform_geolocations(geolocation_rules, [record.[field] for field in location_fields])
+ annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])
Second change allows blank lines in the location-rules TSV
- if line.lstrip()[0] == '#':
+ if line.strip()=="" or line.lstrip()[0] == '#':
"""

import argparse
import json
Expand Down

0 comments on commit 7d507f7

Please sign in to comment.