-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add transform strain names script #20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for working on these @j23414!
I've left a comment on modifications to the join-metadata-and-clades script that I think are needed for it to be used for other pathogens.
I would also like a little more detail in the commit messages. It would be helpful to point to the permalink of original scripts, similar to 3b69a10
join-metadata-and-clades.py
Outdated
|
||
column_map = { | ||
"clade": "clade", | ||
"outbreak": "outbreak", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe "outbreak" is a very specific column for mpox.
I expect that each pathogen can have unique columns in the Nextclade output (e.g. ncov-ingest includes SC2 specific columns). We should probably make the column map customizable to support these. It might be easiest to add an input for a TSV column map that can override the default defined in the script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added column map customization in 7df1636
I moved the old columns to default but also open to dropping the defaults altogether.
Check whether join-metadata-clades script produces an output metadata file, but do not assess the accuracy or validity of that output file. | ||
|
||
$ python $TESTDIR/../../join-metadata-and-clades.py \ | ||
> --id-field strain \ | ||
> --metadata metadata_raw.tsv \ | ||
> --nextclade nextclade.tsv \ | ||
> -o test_metadata.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do at least some minor checking of the output file here. I think at minimum we want to ensure that all of the sequences in the metadata file are included in the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a minor check in a5fa32e
I'm happy to discuss a more detailed checking strategy; I was unsure on level of detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General strategy is just to ensure that the output has the expected format. If we wanted cover all of our bases, we could include an expected output file and just diff the output with the expected output file.
08c108a
to
6a64136
Compare
@joverlee521, ready for review. Thanks for the comments and still open for discussion! Additional edits include: |
6a64136
to
2fa87e3
Compare
Through discussion with @joverlee521, realized that This PR has been rebased to only involve the |
OH, one last thing to note: I had dropped the automated Cram tests in 6e955d7. We should add it back since this PR adds new Cram tests. |
53b8d49
to
6f196f7
Compare
Description of proposed changes
Since these are used for multiple pathogen ingest pipelines, copied the following scripts from monkeypox:
join-metadata-and-clades.py-> moved to [wip]: Add the functionality of join metadata and clades #23Added some tests and documentation.
Related issue(s)
Subset of scripts listed in #1
Checklist