Add transform strain names script #20

j23414 · 2023-09-14T20:22:02Z

Description of proposed changes

Since these are used for multiple pathogen ingest pipelines, copied the following scripts from monkeypox:

transform-strain-name
~~join-metadata-and-clades.py~~ -> moved to [wip]: Add the functionality of join metadata and clades #23

Added some tests and documentation.

Related issue(s)

Subset of scripts listed in #1

Checklist

Checks pass
If adding a script, add an entry for it in the README.

joverlee521

Thank for working on these @j23414!

I've left a comment on modifications to the join-metadata-and-clades script that I think are needed for it to be used for other pathogens.

I would also like a little more detail in the commit messages. It would be helpful to point to the permalink of original scripts, similar to 3b69a10

joverlee521 · 2023-09-15T17:19:21Z

join-metadata-and-clades.py

+
+column_map = {
+    "clade": "clade",
+    "outbreak": "outbreak",


I believe "outbreak" is a very specific column for mpox.

I expect that each pathogen can have unique columns in the Nextclade output (e.g. ncov-ingest includes SC2 specific columns). We should probably make the column map customizable to support these. It might be easiest to add an input for a TSV column map that can override the default defined in the script?

Added column map customization in 7df1636
I moved the old columns to default but also open to dropping the defaults altogether.

join-metadata-and-clades.py

joverlee521 · 2023-09-15T17:24:44Z

tests/join-metadata-and-clades/join-metadata-and-clades.t

+Check whether join-metadata-clades script produces an output metadata file, but do not assess the accuracy or validity of that output file.
+
+  $ python $TESTDIR/../../join-metadata-and-clades.py \
+  >  --id-field strain \
+  >  --metadata metadata_raw.tsv \
+  >  --nextclade nextclade.tsv \
+  >  -o test_metadata.tsv


We should do at least some minor checking of the output file here. I think at minimum we want to ensure that all of the sequences in the metadata file are included in the output.

Added a minor check in a5fa32e
I'm happy to discuss a more detailed checking strategy; I was unsure on level of detail.

General strategy is just to ensure that the output has the expected format. If we wanted cover all of our bases, we could include an expected output file and just diff the output with the expected output file.

j23414 · 2023-09-15T22:03:24Z

@joverlee521, ready for review. Thanks for the comments and still open for discussion!

Additional edits include:

parameterizing the nextclade id so it's not hardcoded as "seqName" c2f5e0e
dropping an unused variable 6a64136

j23414 · 2023-09-15T23:10:28Z

Through discussion with @joverlee521, realized that join-metadata-and-clades.py requires more thought. Therefore the commits were moved to a draft PR #23.

This PR has been rebased to only involve the transform-strain-name script.

joverlee521 · 2023-09-15T23:16:12Z

OH, one last thing to note: I had dropped the automated Cram tests in 6e955d7.

We should add it back since this PR adds new Cram tests.

Copied from: https://github.com/nextstrain/monkeypox/blob/5969604dfe426745b789746427b580c69d484790/ingest/bin/transform-strain-names

j23414 changed the title ~~Transform strain names~~ Add transform strain names and join metadata clades scripts Sep 14, 2023

j23414 self-assigned this Sep 14, 2023

j23414 requested a review from joverlee521 September 14, 2023 23:37

joverlee521 requested changes Sep 15, 2023

View reviewed changes

joverlee521 mentioned this pull request Aug 21, 2023

Add duplicated scripts from pathogen repos #1

Closed

22 tasks

j23414 force-pushed the transform_strain_names branch 2 times, most recently from 08c108a to 6a64136 Compare September 15, 2023 21:48

j23414 force-pushed the transform_strain_names branch from 6a64136 to 2fa87e3 Compare September 15, 2023 22:56

j23414 changed the title ~~Add transform strain names and join metadata clades scripts~~ Add transform strain names script Sep 15, 2023

j23414 mentioned this pull request Sep 15, 2023

[wip]: Add the functionality of join metadata and clades #23

Closed

2 tasks

j23414 added 2 commits September 15, 2023 16:44

Copy transform-strain-names from monkeypox

a14a350

Copied from: https://github.com/nextstrain/monkeypox/blob/5969604dfe426745b789746427b580c69d484790/ingest/bin/transform-strain-names

Add Cram tests

6f196f7

j23414 force-pushed the transform_strain_names branch from 53b8d49 to 6f196f7 Compare September 15, 2023 23:45

joverlee521 approved these changes Sep 18, 2023

View reviewed changes

j23414 merged commit c02fa81 into main Sep 18, 2023

j23414 deleted the transform_strain_names branch September 18, 2023 19:50

j23414 mentioned this pull request Sep 20, 2023

Use centralized script for transform strain names nextstrain/mpox#182

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transform strain names script #20

Add transform strain names script #20

j23414 commented Sep 14, 2023 •

edited

Loading

joverlee521 left a comment

joverlee521 Sep 15, 2023

j23414 Sep 15, 2023

joverlee521 Sep 15, 2023

j23414 Sep 15, 2023

joverlee521 Sep 15, 2023

j23414 commented Sep 15, 2023

j23414 commented Sep 15, 2023

joverlee521 commented Sep 15, 2023

Add transform strain names script #20

Add transform strain names script #20

Conversation

j23414 commented Sep 14, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

joverlee521 left a comment

Choose a reason for hiding this comment

joverlee521 Sep 15, 2023

Choose a reason for hiding this comment

j23414 Sep 15, 2023

Choose a reason for hiding this comment

joverlee521 Sep 15, 2023

Choose a reason for hiding this comment

j23414 Sep 15, 2023

Choose a reason for hiding this comment

joverlee521 Sep 15, 2023

Choose a reason for hiding this comment

j23414 commented Sep 15, 2023

j23414 commented Sep 15, 2023

joverlee521 commented Sep 15, 2023

j23414 commented Sep 14, 2023 •

edited

Loading