Phylogenetic: Increase sampling in the `all` tree to nearly 4K tips #91

j23414 · 2025-01-13T22:58:32Z

Description of proposed changes

Increased the subsampling of the all trees to align more closely with the number of tips in the serotype-specific trees (approximately 4K tips). After exploring a few implementation approaches, this method was selected as the most straightforward and effective for now.

Local tests show the following subsampling results:

genome: 3869 strains passed all filters
E: 3931 strains passed all filters

A few implementation approaches attempted:

[Edit: Added later, sorry I should have written more detail. Questions or suggestions are still welcome]

sequences_per_group: Increase the all sequences_per_group parameter to get approximately 4k samples.
1. ~~This approach is implemented in this PR.~~ --> Switched to approach 2
subsample_max_sequences: Simplify the subsampling logic by setting subsampling to subsample_max_sequences to ~4k samples for all and serotype subsampling.
1. ~~This approach resulted in a loss of temporal signal, making it difficult to infer a reasonable molecular clock rate.~~
2. ~~See the x-axis for the tangletree on the left as compared to the original on the right.~~
3. [Edit: Using both --subsample-max-sequences and -group-by fixed the problem]
subsampling_configurable logic: Swap to the subsampling_configurable logic we currently have in WNV pipelines.
1. This approach is more complex for layered subsampling and may require further team discussions, as there are varying subsampling implementations noted here.

Related issue(s)

phylogenetic: revisit subsampling configuration #90

Checklist

Checks pass
Ran the phylogenetic workflow, confirming that the all tree contains nearly 4k tips:
- https://next.nextstrain.org/staging/dengue/trials/subsampling/dengue/all/genome
- https://next.nextstrain.org/staging/dengue/trials/subsampling/dengue/all/E

Update subsampling of the all trees to match the serotype trees. Local tests show subsampling results similar to: * genome: 3869 strains passed all filters * E: 3931 strains passed all filters

phylogenetic/defaults/config_dengue.yaml

Phylogenetic: Update subsampling

05b014e

Update subsampling of the all trees to match the serotype trees. Local tests show subsampling results similar to: * genome: 3869 strains passed all filters * E: 3931 strains passed all filters

victorlin reviewed Jan 14, 2025

View reviewed changes

phylogenetic/defaults/config_dengue.yaml Outdated Show resolved Hide resolved

Phylogenetic: Use group-by with subsample-max-sequences

0d73b74

j23414 merged commit de4dd02 into main Jan 17, 2025
22 checks passed

j23414 deleted the update-subsampling2 branch January 17, 2025 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phylogenetic: Increase sampling in the `all` tree to nearly 4K tips #91

Phylogenetic: Increase sampling in the `all` tree to nearly 4K tips #91

j23414 commented Jan 13, 2025 •

edited

Loading

Phylogenetic: Increase sampling in the all tree to nearly 4K tips #91

Phylogenetic: Increase sampling in the all tree to nearly 4K tips #91

Conversation

j23414 commented Jan 13, 2025 • edited Loading

Description of proposed changes

A few implementation approaches attempted:

Related issue(s)

Checklist

Phylogenetic: Increase sampling in the `all` tree to nearly 4K tips #91

Phylogenetic: Increase sampling in the `all` tree to nearly 4K tips #91

j23414 commented Jan 13, 2025 •

edited

Loading