-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change minimum genome length to 75% #52
Conversation
49563a2
to
7ce3591
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just posting that the global tree (from GitHub Action) looks reasonable: https://next.nextstrain.org/staging/WNV/trials/75percent/WNV/genome
@@ -74,4 +74,8 @@ For global lineage designations, we query [pathoplexus](https://pathoplexus.org/ | |||
### Host mapping to Host Genus and Host Type | |||
We further refined the information in the NCBI Host column by categorizing it into **Host_Genus** and **Host_Type**, creating broader groupings for more effective data analysis. For example, the **Host** _Homo sapiens_ is classified under **Host_Genus** as _Homo_ and **Host_Type** as Human. This broader categorization is particularly useful for visualizing the phylogenetic tree. Instead of distinguishing between individual mosquito species, you can use the broader categories like **Host_Genus** _Culex_ or the higher-level category **Host_Type** Mosquito to color the tips of the tree. | |||
|
|||
### Determination of Mininum Genome Length | |||
The average genome length of WNV is 10,948 bp. Nextstrain's phylogenetic workflow defaults to excluding sequences with less than 90% genome coverage, as the alignment of short sequences can be unreliable. However, due to the limited number of WNV sequences available in NCBI, we evaluated minimum genome length thresholds of 90% (9,800 bp), 80% (8,700 bp), 75% (8,200 bp), and 70% (7,700 bp). For each threshold, we ran the Washington-focused build and compared: (1) the number of sequences included, (2) data gap locations in the alignment files using an alignment viewer, and (3) the topology and lineage assignments from the phylogenetic tree outputs to determine the optimal threshold. We concluded that a minimum genome length of 75% (8,200 bp) included a higher number of sequences while balancing alignment quality. Lastly, we validated this threshold using the global build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, was the differential length validated against the global build at all? (for example, by looking at 70% and 75% or something, not the full range…)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lastly, we validated this threshold using the global build.
Thanks for asking! The final step involved validating the 75% genome length threshold against the global build to check that the reduction in length threshold did not drastically alter the tree topology, other than nearly polytomic regions.
I believe @DOH-LMT2303 performed the global tree validation manually. For this and similar PRs, I’ve developed the habit of triggering a phylogenetic workflow GitHub action with a trial name (e.g., 75percent
) for the branch under development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, I think this is sufficiently ready to merge. We might want to explore moving subsampling documentation into the config file as comments or the description.md files, but I'm willing to defer that conversation and decision.
Post merge, adding some documentation about why the state threshold was applied to the global tree as well:
|
Description of proposed changes
Change minimum genome length from 90% to 75%.
Related issue(s)
Checklist