filter: Error on sequence duplicates #1613

victorlin · 2024-08-28T23:21:44Z

Description of proposed changes

Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.

Reasoning for this change is here: #810 (comment)

Related issue(s)

Closes #1602

Checklist

Automated checks pass
Check if you need to add a changelog message
Check if you need to add tests
Check if you need to update docs

codecov · 2024-08-28T23:48:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.14%. Comparing base (db54927) to head (da9301c).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1613      +/-   ##
==========================================
+ Coverage   71.06%   71.14%   +0.07%     
==========================================
  Files          79       79              
  Lines        8268     8273       +5     
  Branches     2010     2011       +1     
==========================================
+ Hits         5876     5886      +10     
+ Misses       2101     2099       -2     
+ Partials      291      288       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

CHANGES.md

joverlee521

Hoping we don't run into downstream errors in pathogen repos since we are generally using the GenBank accession as the sequence id. 🤞 (pathogen-repo-ci wouldn't catch these errors since those jobs use example data that do not have duplicates)

augur/filter/_run.py

jameshadfield

Looks good by inspection and will be a very helpful change

CHANGES.md

Prepping for duplicate handling which should be done regardless of args.output (sequence output).

Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.

This is similar enough to the existing test for metadata that I've repurposed that file to test duplicates in both input types.

victorlin · 2024-09-10T21:34:30Z

I will plan to merge this once another breaking change from DEPRECATED.md is also ready to be merged, so they can be released in the same major version.

corneliusroemer · 2024-10-04T11:42:53Z

Update

I found the reasoning for this change in #810 (comment) (one needs to follow quite a few links to eventually end up there), in the future might be nice to add this to the PR description, it would have saved me quite some time.

It might have been nice as well to point out in the changelog that to migrate one should use augur merge to deduplicate. I feel the breaking change was maybe introduced a little to quickly here. I might have preferred to issue a (deprecation) warning for a while when there are duplicates found, which would have allowed us to find out how many workflows would be affected.

Original comment

Hmm, just seeing this result in CI failure in ncov - not sure this is such a great change? It makes filter less usable as an entry point. Why do we need to require deduplication before filter?

The reasoning is given just as:

Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.

Who expects it to be de-duplicated?

Could we potentially allow opt-out from throwing an error to make it easier to upgrade without having to modify workflows? Maybe we should only error if the sequences are different but happen to have the same name? If they are identical there's no reason to error, really?

https://github.com/nextstrain/ncov/actions/runs/11164069511/job/31032539057#step:5:12039

[batch] [2024-10-03T16:24:15+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:15+00:00] Wuhan/Hu-1/2019
[batch] [2024-10-03T16:24:15+00:00] ================================================================================
[batch] [2024-10-03T16:24:39+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:39+00:00] Wuhan/Hu-1/2019
[batch] [2024-10-03T16:24:40+00:00] [Thu Oct  3 16:24:39 2024]
[batch] [2024-10-03T16:24:40+00:00] Error in rule combine_samples:
[batch] [2024-10-03T16:24:40+00:00]     jobid: 863
[batch] [2024-10-03T16:24:40+00:00]     input: results/gisaid_21L_aligned.fasta.zst, results/gisaid_21L_metadata.tsv.zst, results/oceania_6m/sample-focal_early.txt, results/oceania_6m/sample-context_early.txt, results/oceania_6m/sample-focal_recent.txt, results/oceania_6m/sample-context_recent.txt
[batch] [2024-10-03T16:24:40+00:00]     output: results/oceania_6m/oceania_6m_subsampled_sequences.fasta.xz, results/oceania_6m/oceania_6m_subsampled_metadata.tsv.xz
[batch] [2024-10-03T16:24:40+00:00]     log: logs/subsample_regions_oceania_6m.txt (check log file(s) for error details)
[batch] [2024-10-03T16:24:40+00:00]     conda-env: /nextstrain/build/.snakemake/conda/ef7f392b0ecf86741cd7c0bee42f4f0e_
[batch] [2024-10-03T16:24:40+00:00]     shell:
[batch] [2024-10-03T16:24:40+00:00]         
[batch] [2024-10-03T16:24:40+00:00]         augur filter \
[batch] [2024-10-03T16:24:40+00:00]             --sequences results/gisaid_21L_aligned.fasta.zst \
[batch] [2024-10-03T16:24:40+00:00]             --metadata results/gisaid_21L_metadata.tsv.zst \
[batch] [2024-10-03T16:24:40+00:00]             --exclude-all \
[batch] [2024-10-03T16:24:40+00:00]             --include results/oceania_6m/sample-focal_early.txt results/oceania_6m/sample-context_early.txt results/oceania_6m/sample-focal_recent.txt results/oceania_6m/sample-context_recent.txt \
[batch] [2024-10-03T16:24:40+00:00]             --output-sequences results/oceania_6m/oceania_6m_subsampled_sequences.fasta.xz \
[batch] [2024-10-03T16:24:40+00:00]             --output-metadata results/oceania_6m/oceania_6m_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_oceania_6m.txt
[batch] [2024-10-03T16:24:40+00:00]         
[batch] [2024-10-03T16:24:40+00:00]         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] [2024-10-03T16:24:40+00:00] Logfile logs/subsample_regions_oceania_6m.txt:
[batch] [2024-10-03T16:24:40+00:00] ================================================================================
[batch] [2024-10-03T16:24:40+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:40+00:00] Wuhan/Hu-1/2019

victorlin · 2024-10-04T18:21:56Z

@corneliusroemer thanks for your thoughts here and apologies for making an abrupt and unclear change. This is a good case of a developer (me) being out of touch with actual usage. If I were to do this again, I would treat the previous behavior (allowing sequence duplicates) as a "feature" with deprecation warning instead of a bug that needed to be fixed immediately.

Why do we need to require deduplication before filter?

The reasoning is more clear for metadata input: it is indexed on the ID column which must be unique. It's less clear for sequences, but consistency between metadata and sequence output would be my reasoning.

Could we potentially allow opt-out from throwing an error to make it easier to upgrade without having to modify workflows?

It's do-able though it would be specific to sequences, something along the lines of --allow-duplicate-sequences. If the sole purpose is for backwards compatibility, we could introduce it temporarily with a deprecation warning?

victorlin self-assigned this Aug 28, 2024

victorlin marked this pull request as ready for review August 28, 2024 23:35

tsibley reviewed Aug 29, 2024

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

victorlin added the breaking Makes a backwards incompatible change and should wait for major release label Aug 29, 2024

victorlin force-pushed the victorlin/filter-error-on-sequence-duplicates branch from e9e31f5 to ab967fd Compare August 29, 2024 17:54

joverlee521 approved these changes Aug 29, 2024

View reviewed changes

augur/filter/_run.py Outdated Show resolved Hide resolved

victorlin mentioned this pull request Aug 29, 2024

merge: Support sequences #1579

Open

victorlin force-pushed the victorlin/filter-error-on-sequence-duplicates branch 2 times, most recently from 0f4cf31 to 1a7d4d6 Compare August 29, 2024 22:42

genehack reviewed Sep 3, 2024

View reviewed changes

augur/filter/_run.py Show resolved Hide resolved

jameshadfield approved these changes Sep 10, 2024

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

victorlin added 4 commits September 10, 2024 14:19

Reorganize sequence handling logic

45d68bd

Prepping for duplicate handling which should be done regardless of args.output (sequence output).

Error on sequence duplicates

6421433

Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.

Add test on duplicates in sequence input

6e2124f

This is similar enough to the existing test for metadata that I've repurposed that file to test duplicates in both input types.

Update changelog

da9301c

victorlin force-pushed the victorlin/filter-error-on-sequence-duplicates branch from 1a7d4d6 to da9301c Compare September 10, 2024 21:32

victorlin mentioned this pull request Sep 16, 2024

merge: Omit generated source columns by default #1632

Merged

4 tasks

tsibley merged commit d672de0 into master Sep 17, 2024
28 checks passed

tsibley deleted the victorlin/filter-error-on-sequence-duplicates branch September 17, 2024 17:17

corneliusroemer mentioned this pull request Oct 4, 2024

Augur 26 breaks 21L workflow due to duplicate sequences erroring in augur filter nextstrain/ncov#1155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Error on sequence duplicates #1613

filter: Error on sequence duplicates #1613

victorlin commented Aug 28, 2024 •

edited by corneliusroemer

Loading

codecov bot commented Aug 28, 2024 •

edited

Loading

joverlee521 left a comment

jameshadfield left a comment

victorlin commented Sep 10, 2024

corneliusroemer commented Oct 4, 2024

victorlin commented Oct 4, 2024

filter: Error on sequence duplicates #1613

filter: Error on sequence duplicates #1613

Conversation

victorlin commented Aug 28, 2024 • edited by corneliusroemer Loading

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented Aug 28, 2024 • edited Loading

Codecov Report

joverlee521 left a comment

Choose a reason for hiding this comment

jameshadfield left a comment

Choose a reason for hiding this comment

victorlin commented Sep 10, 2024

corneliusroemer commented Oct 4, 2024

Update

Original comment

victorlin commented Oct 4, 2024

victorlin commented Aug 28, 2024 •

edited by corneliusroemer

Loading

codecov bot commented Aug 28, 2024 •

edited

Loading