Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phylogenetic: revisit subsampling configuration #90

Open
j23414 opened this issue Jan 4, 2025 · 4 comments
Open

phylogenetic: revisit subsampling configuration #90

j23414 opened this issue Jan 4, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@j23414
Copy link
Contributor

j23414 commented Jan 4, 2025

Context

Revisit subsampling configuration for dengue filter commands. Currently there're only ~1500 genomes from all serotypes, when there should probably be ~4000. This is due to the fact that the subsampling logic is currently split by groups year and region

group_by: "year region"
min_length:
genome: 5000
E: 1000
sequences_per_group:
all: '10'
denv1: '36'
denv2: '36'
denv3: '36'
denv4: '36'

Instead of backwards calculating the all sequences_per_group to get ~4000 samples, it was suggested that we could set subsample_max_sequence to 4000.

Since I'm revisiting this anyway, I'm considering if we should also swap to the subsampling_configurable logic we currently have in WNV pipelines. I believe we're planning to move toward that implementation in the future along with some config validation work.

https://github.com/nextstrain/WNV/blob/9c7ddac3baf76157412d58c7c76a0cf4a347cb77/phylogenetic/defaults/config.yaml#L67-L69

The above text will probably change as I solidify my opinions on a direction, I'm writing up a draft intention here.

Description

A clear and concise description of what you want to happen

Examples

Possible solution

(Optional)

@j23414 j23414 added the enhancement New feature or request label Jan 4, 2025
@j23414 j23414 changed the title [draft] phylogenetic: revisit subsampling configuration phylogenetic: revisit subsampling configuration Jan 7, 2025
@j23414
Copy link
Contributor Author

j23414 commented Jan 8, 2025

I'm slowly realizing that a straight up swap to subsampling configurable logic might not be straightforward when we are building multiple trees (e.g. genome has a different min-length filter than E gene). For example:

subsampling:
  geotemporal: ---group-by year region --sequences-per-group 36 --min-length [genome length or E gene length]

And I'm a bit vague on how we'd use the subsampling config if we want different layers of subsampling for serotypes (denv1-denv4) then the all trees. I'm open to suggestions, just logging some thoughts.

@victorlin
Copy link
Member

victorlin commented Jan 9, 2025

I think most of our multi-build phylogenetic workflows use the same subsampling config across trees/builds/datasets so this hasn't been much of an issue.

There are at least a few pathogen repos that have the same need, with varying implementation:

  1. ncov, in nextstrain_profiles/nextstrain-open/builds.yaml:

    builds:
      global_1m:
        subsampling_scheme: nextstrain_global_1m
      africa_1m:
        subsampling_scheme: nextstrain_region_1m
      asia_1m:
        subsampling_scheme: nextstrain_region_1m
    
    subsampling:
      nextstrain_global_1m: <subsampling config>
      nextstrain_region_1m: <subsampling config>
  2. seasonal-flu, in profiles/europe/builds.yaml:

    array-builds:
      WIC:
        subsampling: <subsampling config>
      europe:
        subsampling: <subsampling config>
      countries:
        subsampling: <subsampling config>
  3. rsv, in config/configfile.yaml:

    filter:
      resolutions:
        all-time: <subsampling config>
        6y: <subsampling config>
        3y: <subsampling config>
  4. mpox. This one is different because it produces multiple trees with different runs of the workflow instead of in the same run. The subsampling configs are defined separately in phylogenetic/defaults/(clade-i|hmpxv1|hmpxv1_big|mpxv)/config.yaml which each have:

    subsample: <subsampling config>

I'm not sure which is "better", but my sense is that all these repos have widely varying Snakemake logic even beyond subsampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants