Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Error on sequence duplicates #1613

Merged
merged 4 commits into from
Sep 17, 2024

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Aug 28, 2024

Description of proposed changes

Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.

Reasoning for this change is here: #810 (comment)

Related issue(s)

Closes #1602

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

@victorlin victorlin self-assigned this Aug 28, 2024
@victorlin victorlin marked this pull request as ready for review August 28, 2024 23:35
Copy link

codecov bot commented Aug 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.14%. Comparing base (db54927) to head (da9301c).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1613      +/-   ##
==========================================
+ Coverage   71.06%   71.14%   +0.07%     
==========================================
  Files          79       79              
  Lines        8268     8273       +5     
  Branches     2010     2011       +1     
==========================================
+ Hits         5876     5886      +10     
+ Misses       2101     2099       -2     
+ Partials      291      288       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

CHANGES.md Outdated Show resolved Hide resolved
@victorlin victorlin added the breaking Makes a backwards incompatible change and should wait for major release label Aug 29, 2024
@victorlin victorlin force-pushed the victorlin/filter-error-on-sequence-duplicates branch from e9e31f5 to ab967fd Compare August 29, 2024 17:54
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoping we don't run into downstream errors in pathogen repos since we are generally using the GenBank accession as the sequence id. 🤞 (pathogen-repo-ci wouldn't catch these errors since those jobs use example data that do not have duplicates)

augur/filter/_run.py Outdated Show resolved Hide resolved
@victorlin victorlin force-pushed the victorlin/filter-error-on-sequence-duplicates branch 2 times, most recently from 0f4cf31 to 1a7d4d6 Compare August 29, 2024 22:42
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good by inspection and will be a very helpful change

CHANGES.md Outdated Show resolved Hide resolved
Prepping for duplicate handling which should be done regardless of
args.output (sequence output).
Data is expected to be de-duplicated prior to running augur filter. A
similar check is already in place for --metadata.
This is similar enough to the existing test for metadata that I've
repurposed that file to test duplicates in both input types.
@victorlin victorlin force-pushed the victorlin/filter-error-on-sequence-duplicates branch from 1a7d4d6 to da9301c Compare September 10, 2024 21:32
@victorlin
Copy link
Member Author

I will plan to merge this once another breaking change from DEPRECATED.md is also ready to be merged, so they can be released in the same major version.

@tsibley tsibley merged commit d672de0 into master Sep 17, 2024
28 checks passed
@tsibley tsibley deleted the victorlin/filter-error-on-sequence-duplicates branch September 17, 2024 17:17
@corneliusroemer
Copy link
Member

Update

I found the reasoning for this change in #810 (comment) (one needs to follow quite a few links to eventually end up there), in the future might be nice to add this to the PR description, it would have saved me quite some time.

It might have been nice as well to point out in the changelog that to migrate one should use augur merge to deduplicate. I feel the breaking change was maybe introduced a little to quickly here. I might have preferred to issue a (deprecation) warning for a while when there are duplicates found, which would have allowed us to find out how many workflows would be affected.

Original comment

Hmm, just seeing this result in CI failure in ncov - not sure this is such a great change? It makes filter less usable as an entry point. Why do we need to require deduplication before filter?

The reasoning is given just as:

Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.

Who expects it to be de-duplicated?

Could we potentially allow opt-out from throwing an error to make it easier to upgrade without having to modify workflows? Maybe we should only error if the sequences are different but happen to have the same name? If they are identical there's no reason to error, really?

https://github.com/nextstrain/ncov/actions/runs/11164069511/job/31032539057#step:5:12039

Brave Browser 2024-10-04 13 31 08
[batch] [2024-10-03T16:24:15+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:15+00:00] Wuhan/Hu-1/2019
[batch] [2024-10-03T16:24:15+00:00] ================================================================================
[batch] [2024-10-03T16:24:39+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:39+00:00] Wuhan/Hu-1/2019
[batch] [2024-10-03T16:24:40+00:00] [Thu Oct  3 16:24:39 2024]
[batch] [2024-10-03T16:24:40+00:00] Error in rule combine_samples:
[batch] [2024-10-03T16:24:40+00:00]     jobid: 863
[batch] [2024-10-03T16:24:40+00:00]     input: results/gisaid_21L_aligned.fasta.zst, results/gisaid_21L_metadata.tsv.zst, results/oceania_6m/sample-focal_early.txt, results/oceania_6m/sample-context_early.txt, results/oceania_6m/sample-focal_recent.txt, results/oceania_6m/sample-context_recent.txt
[batch] [2024-10-03T16:24:40+00:00]     output: results/oceania_6m/oceania_6m_subsampled_sequences.fasta.xz, results/oceania_6m/oceania_6m_subsampled_metadata.tsv.xz
[batch] [2024-10-03T16:24:40+00:00]     log: logs/subsample_regions_oceania_6m.txt (check log file(s) for error details)
[batch] [2024-10-03T16:24:40+00:00]     conda-env: /nextstrain/build/.snakemake/conda/ef7f392b0ecf86741cd7c0bee42f4f0e_
[batch] [2024-10-03T16:24:40+00:00]     shell:
[batch] [2024-10-03T16:24:40+00:00]         
[batch] [2024-10-03T16:24:40+00:00]         augur filter \
[batch] [2024-10-03T16:24:40+00:00]             --sequences results/gisaid_21L_aligned.fasta.zst \
[batch] [2024-10-03T16:24:40+00:00]             --metadata results/gisaid_21L_metadata.tsv.zst \
[batch] [2024-10-03T16:24:40+00:00]             --exclude-all \
[batch] [2024-10-03T16:24:40+00:00]             --include results/oceania_6m/sample-focal_early.txt results/oceania_6m/sample-context_early.txt results/oceania_6m/sample-focal_recent.txt results/oceania_6m/sample-context_recent.txt \
[batch] [2024-10-03T16:24:40+00:00]             --output-sequences results/oceania_6m/oceania_6m_subsampled_sequences.fasta.xz \
[batch] [2024-10-03T16:24:40+00:00]             --output-metadata results/oceania_6m/oceania_6m_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_oceania_6m.txt
[batch] [2024-10-03T16:24:40+00:00]         
[batch] [2024-10-03T16:24:40+00:00]         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] [2024-10-03T16:24:40+00:00] Logfile logs/subsample_regions_oceania_6m.txt:
[batch] [2024-10-03T16:24:40+00:00] ================================================================================
[batch] [2024-10-03T16:24:40+00:00] ERROR: The following strains are duplicated in 'results/gisaid_21L_aligned.fasta.zst':
[batch] [2024-10-03T16:24:40+00:00] Wuhan/Hu-1/2019

@victorlin
Copy link
Member Author

@corneliusroemer thanks for your thoughts here and apologies for making an abrupt and unclear change. This is a good case of a developer (me) being out of touch with actual usage. If I were to do this again, I would treat the previous behavior (allowing sequence duplicates) as a "feature" with deprecation warning instead of a bug that needed to be fixed immediately.

Why do we need to require deduplication before filter?

The reasoning is more clear for metadata input: it is indexed on the ID column which must be unique. It's less clear for sequences, but consistency between metadata and sequence output would be my reasoning.

Could we potentially allow opt-out from throwing an error to make it easier to upgrade without having to modify workflows?

It's do-able though it would be specific to sequences, something along the lines of --allow-duplicate-sequences. If the sole purpose is for backwards compatibility, we could introduce it temporarily with a deprecation warning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Makes a backwards incompatible change and should wait for major release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filter: --output-sequences silently allows duplicates
6 participants