Skip to content

Commit

Permalink
Copy transform-strain-names from monkeypox
Browse files Browse the repository at this point in the history
  • Loading branch information
j23414 committed Sep 15, 2023
1 parent 9cfed8b commit 2fa87e3
Show file tree
Hide file tree
Showing 3 changed files with 68 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ Potential augur curate scripts
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
- [transform-strain-names](transform-strain-names) - Ordered search for strain names across several fields.

## Software requirements

Expand Down
17 changes: 17 additions & 0 deletions tests/transform-strain-names/transform-strain-names.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Look for strain name in "strain" or a list of backup fields.

If strain entry exists, do not do anything.

$ echo '{"strain": "i/am/a/strain", "strain_s": "other"}' \
> | $TESTDIR/../../transform-strain-names \
> --strain-regex '^.+$' \
> --backup-fields strain_s accession
{"strain":"i/am/a/strain","strain_s":"other"}

If strain entry does not exists, search the backup fields

$ echo '{"strain_s": "other"}' \
> | $TESTDIR/../../transform-strain-names \
> --strain-regex '^.+$' \
> --backup-fields accession strain_s
{"strain_s":"other","strain":"other"}
50 changes: 50 additions & 0 deletions transform-strain-names
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/usr/bin/env python3
"""
Verifies strain name pattern in the 'strain' field of the NDJSON record from
stdin. Adds a 'strain' field to the record if it does not already exist.
Outputs the modified records to stdout.
"""
import argparse
import json
import re
from sys import stderr, stdin, stdout


if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("--strain-regex", default="^.+$",
help="Regex pattern for strain names. " +
"Strain names that do not match the pattern will be dropped.")
parser.add_argument("--backup-fields", nargs="*",
help="List of backup fields to use as strain name if the value in 'strain' " +
"does not match the strain regex pattern. " +
"If multiple fields are provided, will use the first field that has a non-empty string.")

args = parser.parse_args()

strain_name_pattern = re.compile(args.strain_regex)

for index, record in enumerate(stdin):
record = json.loads(record)

# Verify strain name matches the strain regex pattern
if strain_name_pattern.match(record.get('strain', '')) is None:
# Default to empty string if not matching pattern
record['strain'] = ''
# Use non-empty value of backup fields if provided
if args.backup_fields:
for field in args.backup_fields:
if record.get(field):
record['strain'] = str(record[field])
break

if record['strain'] == '':
print(f"WARNING: Record number {index} has an empty string as the strain name.", file=stderr)


json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
print()

0 comments on commit 2fa87e3

Please sign in to comment.