Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phylogenetic: Increase sampling in the all tree to nearly 4K tips #91

Merged
merged 2 commits into from
Jan 17, 2025

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Jan 13, 2025

Description of proposed changes

Increased the subsampling of the all trees to align more closely with the number of tips in the serotype-specific trees (approximately 4K tips). After exploring a few implementation approaches, this method was selected as the most straightforward and effective for now.

Local tests show the following subsampling results:

  • genome: 3869 strains passed all filters
  • E: 3931 strains passed all filters

A few implementation approaches attempted:

[Edit: Added later, sorry I should have written more detail. Questions or suggestions are still welcome]

  1. sequences_per_group: Increase the all sequences_per_group parameter to get approximately 4k samples.
    1. This approach is implemented in this PR. --> Switched to approach 2
  2. subsample_max_sequences: Simplify the subsampling logic by setting subsampling to subsample_max_sequences to ~4k samples for all and serotype subsampling.
    1. This approach resulted in a loss of temporal signal, making it difficult to infer a reasonable molecular clock rate.
    2. See the x-axis for the tangletree on the left as compared to the original on the right.
    3. [Edit: Using both --subsample-max-sequences and -group-by fixed the problem]
  3. subsampling_configurable logic: Swap to the subsampling_configurable logic we currently have in WNV pipelines.
    1. This approach is more complex for layered subsampling and may require further team discussions, as there are varying subsampling implementations noted here.

Related issue(s)

Checklist

Update subsampling of the all trees to match the serotype trees. Local tests
show subsampling results similar to:

* genome: 3869 strains passed all filters
* E: 3931 strains passed all filters
@j23414 j23414 merged commit de4dd02 into main Jan 17, 2025
22 checks passed
@j23414 j23414 deleted the update-subsampling2 branch January 17, 2025 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants