Phylogenetic: Increase sampling in the all
tree to nearly 4K tips
#91
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Increased the subsampling of the
all
trees to align more closely with the number of tips in the serotype-specific trees (approximately 4K tips). After exploring a few implementation approaches, this method was selected as the most straightforward and effective for now.Local tests show the following subsampling results:
A few implementation approaches attempted:
[Edit: Added later, sorry I should have written more detail. Questions or suggestions are still welcome]
all
sequences_per_group
parameter to get approximately 4k samples.This approach is implemented in this PR.--> Switched to approach 2subsample_max_sequences
to ~4k samples forall
andserotype
subsampling.This approach resulted in a loss of temporal signal, making it difficult to infer a reasonable molecular clock rate.See the x-axis for the tangletree on the left as compared to the original on the right.--subsample-max-sequences
and-group-by
fixed the problem]Related issue(s)
Checklist
all
tree contains nearly 4k tips: