Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change minimum genome length to 75% #52

Merged
merged 3 commits into from
Jan 9, 2025
Merged

Conversation

DOH-LMT2303
Copy link
Collaborator

@DOH-LMT2303 DOH-LMT2303 commented Dec 20, 2024

Description of proposed changes

Change minimum genome length from 90% to 75%.

Related issue(s)

Checklist

  • Checks pass
  • Add documentation on 75% genome length threshold

@j23414 j23414 self-requested a review December 20, 2024 22:31
Copy link
Collaborator

@j23414 j23414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just posting that the global tree (from GitHub Action) looks reasonable: https://next.nextstrain.org/staging/WNV/trials/75percent/WNV/genome

@@ -74,4 +74,8 @@ For global lineage designations, we query [pathoplexus](https://pathoplexus.org/
### Host mapping to Host Genus and Host Type
We further refined the information in the NCBI Host column by categorizing it into **Host_Genus** and **Host_Type**, creating broader groupings for more effective data analysis. For example, the **Host** _Homo sapiens_ is classified under **Host_Genus** as _Homo_ and **Host_Type** as Human. This broader categorization is particularly useful for visualizing the phylogenetic tree. Instead of distinguishing between individual mosquito species, you can use the broader categories like **Host_Genus** _Culex_ or the higher-level category **Host_Type** Mosquito to color the tips of the tree.

### Determination of Mininum Genome Length
The average genome length of WNV is 10,948 bp. Nextstrain's phylogenetic workflow defaults to excluding sequences with less than 90% genome coverage, as the alignment of short sequences can be unreliable. However, due to the limited number of WNV sequences available in NCBI, we evaluated minimum genome length thresholds of 90% (9,800 bp), 80% (8,700 bp), 75% (8,200 bp), and 70% (7,700 bp). For each threshold, we ran the Washington-focused build and compared: (1) the number of sequences included, (2) data gap locations in the alignment files using an alignment viewer, and (3) the topology and lineage assignments from the phylogenetic tree outputs to determine the optimal threshold. We concluded that a minimum genome length of 75% (8,200 bp) included a higher number of sequences while balancing alignment quality. Lastly, we validated this threshold using the global build.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, was the differential length validated against the global build at all? (for example, by looking at 70% and 75% or something, not the full range…)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lastly, we validated this threshold using the global build.

Thanks for asking! The final step involved validating the 75% genome length threshold against the global build to check that the reduction in length threshold did not drastically alter the tree topology, other than nearly polytomic regions.

I believe @DOH-LMT2303 performed the global tree validation manually. For this and similar PRs, I’ve developed the habit of triggering a phylogenetic workflow GitHub action with a trial name (e.g., 75percent) for the branch under development.

Copy link
Collaborator

@j23414 j23414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, I think this is sufficiently ready to merge. We might want to explore moving subsampling documentation into the config file as comments or the description.md files, but I'm willing to defer that conversation and decision.

@DOH-LMT2303 DOH-LMT2303 marked this pull request as ready for review January 9, 2025 20:40
@DOH-LMT2303 DOH-LMT2303 merged commit 42e5062 into main Jan 9, 2025
4 checks passed
@DOH-LMT2303 DOH-LMT2303 deleted the adjust_genome_length branch January 9, 2025 20:57
@j23414
Copy link
Collaborator

j23414 commented Jan 9, 2025

Post merge, adding some documentation about why the state threshold was applied to the global tree as well:

We applied the same threshold to both state and global builds to keep things simple and consistent. The goal was to avoid having different arbitrary thresholds unless there was a clear reason to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants