Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phylogenetic: Clade I build failed during filter #283

Closed
joverlee521 opened this issue Oct 30, 2024 · 11 comments · Fixed by #284
Closed

phylogenetic: Clade I build failed during filter #283

joverlee521 opened this issue Oct 30, 2024 · 11 comments · Fixed by #284
Assignees
Labels
bug Something isn't working

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Oct 30, 2024

Since we've migrated ingest to Nextclade v3 in #281, the clade labels have been updated to include Ia and Ib.

This led to the automated build for Clade I to fail during the filter step

augur filter             --sequences data/sequences.fasta             --metadata data/metadata.tsv             --metadata-id-columns accession             --output-sequences results/clade-i/good_sequences.fasta             --output-metadata results/clade-i/good_metadata.tsv             --exclude defaults/exclude_accessions.txt             --exclude-where clade!=I             --min-date 1900             --min-length 100000             --query "(QC_rare_mutations == 'good' | QC_rare_mutations == 'mediocre')"             --output-log results/clade-i/good_filter.log
        
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
9051 strains were dropped during filtering
	110 were dropped because they were in defaults/exclude_accessions.txt
	8897 were dropped because of 'clade!=I'
	44 were dropped because they were shorter than the minimum length of 100000bp when only counting standard nucleotide characters A, C, G, or T (case-insensitive)
ERROR: All samples have been dropped! Check filter rules and metadata file format.
@joverlee521 joverlee521 added the bug Something isn't working label Oct 30, 2024
@joverlee521 joverlee521 self-assigned this Oct 30, 2024
@joverlee521
Copy link
Contributor Author

joverlee521 commented Oct 30, 2024

Hrm, I thought this was simple fix of just updating the config with:

diff --git a/phylogenetic/defaults/clade-i/config.yaml b/phylogenetic/defaults/clade-i/config.yaml
index e8ad850..dfda6a8 100644
--- a/phylogenetic/defaults/clade-i/config.yaml
+++ b/phylogenetic/defaults/clade-i/config.yaml
@@ -19,7 +19,7 @@ auspice_name: "mpox_clade-I"
 filter:
   min_date: 1900
   min_length: 100000
-  exclude_where: 'clade!=I'
+  exclude_where: 'clade!=I clade!=Ia clade!=Ib'

However, this doesn't work because of how the augur filter handles multiple values for the --exclude-where option. From augur filter docs:

Exclude samples matching these conditions. Ex: “host=rat” or “host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND

@victorlin
Copy link
Member

What about --query 'clade in {"I", "Ia", "Ib"}'?

@joverlee521
Copy link
Contributor Author

Without making changes to the workflow itself, there are two paths forward:

  1. Continue to use the exclude_where config param, but update it to specifically exclude other clades.
diff --git a/phylogenetic/defaults/clade-i/config.yaml b/phylogenetic/defaults/clade-i/config.yaml
index e8ad850..4103fba 100644
--- a/phylogenetic/defaults/clade-i/config.yaml
+++ b/phylogenetic/defaults/clade-i/config.yaml
@@ -19,7 +19,7 @@ auspice_name: "mpox_clade-I"
 filter:
   min_date: 1900
   min_length: 100000
-  exclude_where: 'clade!=I'
+  exclude_where: 'clade=II clade=IIa clade=IIb clade=outgroup clade=""'
  1. Use --query in the subsampling other_filters config param:
diff --git a/phylogenetic/defaults/clade-i/config.yaml b/phylogenetic/defaults/clade-i/config.yaml
index e8ad850..f40e848 100644
--- a/phylogenetic/defaults/clade-i/config.yaml
+++ b/phylogenetic/defaults/clade-i/config.yaml
@@ -19,7 +19,6 @@ auspice_name: "mpox_clade-I"
 filter:
   min_date: 1900
   min_length: 100000
-  exclude_where: 'clade!=I'
 
 
 ### We don't want to subsample, so specify a config which is essentially a no-op
@@ -27,6 +26,7 @@ subsample:
   everything:
     group_by: ""
     sequences_per_group: ""
+    other_filters: "--query \"(clade == 'I' | clade == 'Ia' | clade == 'Ib')\""
 
 ## align
 max_indel: 10000

Option (1) is less change but probably more maintenance since it will need to be updated if there are new clades that need to be excluded. Option (2) is safer in that it will always only include clade I and will only need to be updated if there new children clades of clade I.

@victorlin
Copy link
Member

+1 for Option (2): it's more logically direct for the Clade I build to "include Clade I and subclades", not "exclude all other subclades".

@victorlin
Copy link
Member

Without making changes to the workflow itself

Why not update the workflow to accept a filter.query config?

@joverlee521
Copy link
Contributor Author

Why not update the workflow to accept a filter.query config?

I just noticed that exclude_where was specifically added for clade I builds, so I think the intent was always to do this type of metadata specific filtering in subsample.

@victorlin
Copy link
Member

Oh sorry, I meant to suggest subsample.<sample>.query instead of subsample.<sample>.other_filters: --query …

@joverlee521
Copy link
Contributor Author

Oh sorry, I meant to suggest subsample.<sample>.query instead of subsample.<sample>.other_filters: --query …

Ah, gotcha. Yeah, can do!

joverlee521 added a commit that referenced this issue Oct 31, 2024
Use `--query` to filter to Clade I and children clades since we have
migrated to Nextclade v3 that includes clades I, Ia, and Ib.

Resolves #283
@joverlee521 joverlee521 mentioned this issue Oct 31, 2024
2 tasks
@jameshadfield
Copy link
Member

Since we've migrated ingest to Nextclade v3 in #281, the clade labels have been updated to include Ia and Ib.

For my understanding, this wasn't a nextclade v2 → v3 thing, it was a result of #281 changing the nextclade dataset from "hMPXV" to "MPXV", right? (It doesn't appear that any mpox nextclade datasets have been updated in the last few months, which was my other thought.)

@joverlee521
Copy link
Contributor Author

For my understanding, this wasn't a nextclade v2 → v3 thing, it was a result of #281 changing the nextclade dataset from "hMPXV" to "MPXV", right?

When using Nextclade v2 to download datasets, it only downloads the v2 datasets so it was using the MPXV/ancestral dataset from 2023-08-01. After switching to Nextclade v3, it now uses the latest dataset that includes the new Ia/Ib clades.

@jameshadfield
Copy link
Member

Wow - that's a gotcha right there. Thanks for clarifying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants