phylogenetic: revisit subsampling configuration #90

j23414 · 2025-01-04T20:38:55Z

Context

Revisit subsampling configuration for dengue filter commands. Currently there're only ~1500 genomes from all serotypes, when there should probably be ~4000. This is due to the fact that the subsampling logic is currently split by groups year and region

dengue/phylogenetic/defaults/config_dengue.yaml

Lines 13 to 22 in 8b784e8

    
           group_by: "year region" 
        
           min_length: 
        
             genome: 5000 
        
             E: 1000 
        
           sequences_per_group: 
        
             all: '10' 
        
             denv1: '36' 
        
             denv2: '36' 
        
             denv3: '36' 
        
             denv4: '36'

Instead of backwards calculating the all sequences_per_group to get ~4000 samples, it was suggested that we could set subsample_max_sequence to 4000.

Since I'm revisiting this anyway, I'm considering if we should also swap to the subsampling_configurable logic we currently have in WNV pipelines. I believe we're planning to move toward that implementation in the future along with some config validation work.

https://github.com/nextstrain/WNV/blob/9c7ddac3baf76157412d58c7c76a0cf4a347cb77/phylogenetic/defaults/config.yaml#L67-L69

The above text will probably change as I solidify my opinions on a direction, I'm writing up a draft intention here.

Description

A clear and concise description of what you want to happen

Examples

Possible solution

(Optional)

The text was updated successfully, but these errors were encountered:

j23414 · 2025-01-06T17:40:05Z

From a quick exploration, I propose to keep subsampling across years, otherwise we lose temporal sampling to infer a reasonable molecular clock rate

j23414 · 2025-01-07T00:34:07Z

Adjusting the 'all' subsampling seems to produce better results:

j23414 · 2025-01-08T18:18:03Z

I'm slowly realizing that a straight up swap to subsampling configurable logic might not be straightforward when we are building multiple trees (e.g. genome has a different min-length filter than E gene). For example:

subsampling:
  geotemporal: ---group-by year region --sequences-per-group 36 --min-length [genome length or E gene length]

And I'm a bit vague on how we'd use the subsampling config if we want different layers of subsampling for serotypes (denv1-denv4) then the all trees. I'm open to suggestions, just logging some thoughts.

victorlin · 2025-01-09T05:13:01Z

I think most of our multi-build phylogenetic workflows use the same subsampling config across trees/builds/datasets so this hasn't been much of an issue.

There are at least a few pathogen repos that have the same need, with varying implementation:

ncov, in nextstrain_profiles/nextstrain-open/builds.yaml:

builds:
  global_1m:
    subsampling_scheme: nextstrain_global_1m
  africa_1m:
    subsampling_scheme: nextstrain_region_1m
  asia_1m:
    subsampling_scheme: nextstrain_region_1m

subsampling:
  nextstrain_global_1m: <subsampling config>
  nextstrain_region_1m: <subsampling config>

seasonal-flu, in profiles/europe/builds.yaml:

array-builds:
  WIC:
    subsampling: <subsampling config>
  europe:
    subsampling: <subsampling config>
  countries:
    subsampling: <subsampling config>

rsv, in config/configfile.yaml:

filter:
  resolutions:
    all-time: <subsampling config>
    6y: <subsampling config>
    3y: <subsampling config>

mpox. This one is different because it produces multiple trees with different runs of the workflow instead of in the same run. The subsampling configs are defined separately in phylogenetic/defaults/(clade-i|hmpxv1|hmpxv1_big|mpxv)/config.yaml which each have:
```
subsample: <subsampling config>
```

I'm not sure which is "better", but my sense is that all these repos have widely varying Snakemake logic even beyond subsampling.

j23414 added the enhancement New feature or request label Jan 4, 2025

j23414 changed the title ~~[draft] phylogenetic: revisit subsampling configuration~~ phylogenetic: revisit subsampling configuration Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phylogenetic: revisit subsampling configuration #90

phylogenetic: revisit subsampling configuration #90

j23414 commented Jan 4, 2025 •

edited

Loading

j23414 commented Jan 6, 2025

j23414 commented Jan 7, 2025

j23414 commented Jan 8, 2025

victorlin commented Jan 9, 2025 •

edited

Loading

phylogenetic: revisit subsampling configuration #90

phylogenetic: revisit subsampling configuration #90

Comments

j23414 commented Jan 4, 2025 • edited Loading

Context

Description

Examples

Possible solution

j23414 commented Jan 6, 2025

j23414 commented Jan 7, 2025

j23414 commented Jan 8, 2025

victorlin commented Jan 9, 2025 • edited Loading

j23414 commented Jan 4, 2025 •

edited

Loading

victorlin commented Jan 9, 2025 •

edited

Loading