phylogenetic: Clade I build failed during filter #283

joverlee521 · 2024-10-30T17:56:17Z

Since we've migrated ingest to Nextclade v3 in #281, the clade labels have been updated to include Ia and Ib.

This led to the automated build for Clade I to fail during the filter step

augur filter             --sequences data/sequences.fasta             --metadata data/metadata.tsv             --metadata-id-columns accession             --output-sequences results/clade-i/good_sequences.fasta             --output-metadata results/clade-i/good_metadata.tsv             --exclude defaults/exclude_accessions.txt             --exclude-where clade!=I             --min-date 1900             --min-length 100000             --query "(QC_rare_mutations == 'good' | QC_rare_mutations == 'mediocre')"             --output-log results/clade-i/good_filter.log
        
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
9051 strains were dropped during filtering
	110 were dropped because they were in defaults/exclude_accessions.txt
	8897 were dropped because of 'clade!=I'
	44 were dropped because they were shorter than the minimum length of 100000bp when only counting standard nucleotide characters A, C, G, or T (case-insensitive)
ERROR: All samples have been dropped! Check filter rules and metadata file format.

The text was updated successfully, but these errors were encountered:

joverlee521 · 2024-10-30T19:44:10Z

Hrm, I thought this was simple fix of just updating the config with:

diff --git a/phylogenetic/defaults/clade-i/config.yaml b/phylogenetic/defaults/clade-i/config.yaml
index e8ad850..dfda6a8 100644
--- a/phylogenetic/defaults/clade-i/config.yaml
+++ b/phylogenetic/defaults/clade-i/config.yaml
@@ -19,7 +19,7 @@ auspice_name: "mpox_clade-I"
 filter:
   min_date: 1900
   min_length: 100000
-  exclude_where: 'clade!=I'
+  exclude_where: 'clade!=I clade!=Ia clade!=Ib'

However, this doesn't work because of how the augur filter handles multiple values for the --exclude-where option. From augur filter docs:

Exclude samples matching these conditions. Ex: “host=rat” or “host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND

victorlin · 2024-10-30T20:23:40Z

What about --query 'clade in {"I", "Ia", "Ib"}'?

joverlee521 · 2024-10-30T20:28:22Z

Without making changes to the workflow itself, there are two paths forward:

Continue to use the exclude_where config param, but update it to specifically exclude other clades.

diff --git a/phylogenetic/defaults/clade-i/config.yaml b/phylogenetic/defaults/clade-i/config.yaml
index e8ad850..4103fba 100644
--- a/phylogenetic/defaults/clade-i/config.yaml
+++ b/phylogenetic/defaults/clade-i/config.yaml
@@ -19,7 +19,7 @@ auspice_name: "mpox_clade-I"
 filter:
   min_date: 1900
   min_length: 100000
-  exclude_where: 'clade!=I'
+  exclude_where: 'clade=II clade=IIa clade=IIb clade=outgroup clade=""'

Use --query in the subsampling other_filters config param:

diff --git a/phylogenetic/defaults/clade-i/config.yaml b/phylogenetic/defaults/clade-i/config.yaml
index e8ad850..f40e848 100644
--- a/phylogenetic/defaults/clade-i/config.yaml
+++ b/phylogenetic/defaults/clade-i/config.yaml
@@ -19,7 +19,6 @@ auspice_name: "mpox_clade-I"
 filter:
   min_date: 1900
   min_length: 100000
-  exclude_where: 'clade!=I'
 
 
 ### We don't want to subsample, so specify a config which is essentially a no-op
@@ -27,6 +26,7 @@ subsample:
   everything:
     group_by: ""
     sequences_per_group: ""
+    other_filters: "--query \"(clade == 'I' | clade == 'Ia' | clade == 'Ib')\""
 
 ## align
 max_indel: 10000

Option (1) is less change but probably more maintenance since it will need to be updated if there are new clades that need to be excluded. Option (2) is safer in that it will always only include clade I and will only need to be updated if there new children clades of clade I.

victorlin · 2024-10-30T20:50:52Z

+1 for Option (2): it's more logically direct for the Clade I build to "include Clade I and subclades", not "exclude all other subclades".

victorlin · 2024-10-30T20:53:45Z

Without making changes to the workflow itself

Why not update the workflow to accept a filter.query config?

joverlee521 · 2024-10-30T21:10:27Z

Why not update the workflow to accept a filter.query config?

I just noticed that exclude_where was specifically added for clade I builds, so I think the intent was always to do this type of metadata specific filtering in subsample.

victorlin · 2024-10-30T21:13:49Z

Oh sorry, I meant to suggest subsample.<sample>.query instead of subsample.<sample>.other_filters: --query …

joverlee521 · 2024-10-31T16:19:38Z

Oh sorry, I meant to suggest subsample.<sample>.query instead of subsample.<sample>.other_filters: --query …

Ah, gotcha. Yeah, can do!

Use `--query` to filter to Clade I and children clades since we have migrated to Nextclade v3 that includes clades I, Ia, and Ib. Resolves #283

jameshadfield · 2024-11-04T04:30:38Z

Since we've migrated ingest to Nextclade v3 in #281, the clade labels have been updated to include Ia and Ib.

For my understanding, this wasn't a nextclade v2 → v3 thing, it was a result of #281 changing the nextclade dataset from "hMPXV" to "MPXV", right? (It doesn't appear that any mpox nextclade datasets have been updated in the last few months, which was my other thought.)

joverlee521 · 2024-11-04T17:56:43Z

For my understanding, this wasn't a nextclade v2 → v3 thing, it was a result of #281 changing the nextclade dataset from "hMPXV" to "MPXV", right?

When using Nextclade v2 to download datasets, it only downloads the v2 datasets so it was using the MPXV/ancestral dataset from 2023-08-01. After switching to Nextclade v3, it now uses the latest dataset that includes the new Ia/Ib clades.

jameshadfield · 2024-11-04T18:01:34Z

Wow - that's a gotcha right there. Thanks for clarifying!

joverlee521 added the bug Something isn't working label Oct 30, 2024

joverlee521 self-assigned this Oct 30, 2024

joverlee521 added a commit that referenced this issue Oct 31, 2024

phylogenetic: Update clade-i/config.yaml

6c12b75

Use `--query` to filter to Clade I and children clades since we have migrated to Nextclade v3 that includes clades I, Ia, and Ib. Resolves #283

joverlee521 mentioned this issue Oct 31, 2024

Fix Clade I build #284

Merged

2 tasks

joverlee521 closed this as completed in #284 Oct 31, 2024

jameshadfield mentioned this issue Nov 4, 2024

Add INRB-specific build pipeline #287

Merged

victorlin mentioned this issue Nov 18, 2024

Small tweaks to Nextclade builds #291

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phylogenetic: Clade I build failed during filter #283

phylogenetic: Clade I build failed during filter #283

joverlee521 commented Oct 30, 2024 •

edited

Loading

joverlee521 commented Oct 30, 2024 •

edited

Loading

victorlin commented Oct 30, 2024

joverlee521 commented Oct 30, 2024

victorlin commented Oct 30, 2024

victorlin commented Oct 30, 2024

joverlee521 commented Oct 30, 2024

victorlin commented Oct 30, 2024

joverlee521 commented Oct 31, 2024

jameshadfield commented Nov 4, 2024

joverlee521 commented Nov 4, 2024

jameshadfield commented Nov 4, 2024

phylogenetic: Clade I build failed during filter #283

phylogenetic: Clade I build failed during filter #283

Comments

joverlee521 commented Oct 30, 2024 • edited Loading

joverlee521 commented Oct 30, 2024 • edited Loading

victorlin commented Oct 30, 2024

joverlee521 commented Oct 30, 2024

victorlin commented Oct 30, 2024

victorlin commented Oct 30, 2024

joverlee521 commented Oct 30, 2024

victorlin commented Oct 30, 2024

joverlee521 commented Oct 31, 2024

jameshadfield commented Nov 4, 2024

joverlee521 commented Nov 4, 2024

jameshadfield commented Nov 4, 2024

joverlee521 commented Oct 30, 2024 •

edited

Loading

joverlee521 commented Oct 30, 2024 •

edited

Loading