-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Error on sequence duplicates #1613
filter: Error on sequence duplicates #1613
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1613 +/- ##
==========================================
+ Coverage 71.06% 71.14% +0.07%
==========================================
Files 79 79
Lines 8268 8273 +5
Branches 2010 2011 +1
==========================================
+ Hits 5876 5886 +10
+ Misses 2101 2099 -2
+ Partials 291 288 -3 ☔ View full report in Codecov by Sentry. |
e9e31f5
to
ab967fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoping we don't run into downstream errors in pathogen repos since we are generally using the GenBank accession as the sequence id. 🤞 (pathogen-repo-ci wouldn't catch these errors since those jobs use example data that do not have duplicates)
0f4cf31
to
1a7d4d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good by inspection and will be a very helpful change
Prepping for duplicate handling which should be done regardless of args.output (sequence output).
Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for --metadata.
This is similar enough to the existing test for metadata that I've repurposed that file to test duplicates in both input types.
1a7d4d6
to
da9301c
Compare
I will plan to merge this once another breaking change from DEPRECATED.md is also ready to be merged, so they can be released in the same major version. |
UpdateI found the reasoning for this change in #810 (comment) (one needs to follow quite a few links to eventually end up there), in the future might be nice to add this to the PR description, it would have saved me quite some time. It might have been nice as well to point out in the changelog that to migrate one should use Original commentHmm, just seeing this result in CI failure in ncov - not sure this is such a great change? It makes filter less usable as an entry point. Why do we need to require deduplication before filter? The reasoning is given just as:
Who expects it to be de-duplicated? Could we potentially allow opt-out from throwing an error to make it easier to upgrade without having to modify workflows? Maybe we should only error if the sequences are different but happen to have the same name? If they are identical there's no reason to error, really? https://github.com/nextstrain/ncov/actions/runs/11164069511/job/31032539057#step:5:12039
|
@corneliusroemer thanks for your thoughts here and apologies for making an abrupt and unclear change. This is a good case of a developer (me) being out of touch with actual usage. If I were to do this again, I would treat the previous behavior (allowing sequence duplicates) as a "feature" with deprecation warning instead of a bug that needed to be fixed immediately.
The reasoning is more clear for metadata input: it is indexed on the ID column which must be unique. It's less clear for sequences, but consistency between metadata and sequence output would be my reasoning.
It's do-able though it would be specific to sequences, something along the lines of |
Description of proposed changes
Data is expected to be de-duplicated prior to running augur filter. A similar check is already in place for
--metadata
.Reasoning for this change is here: #810 (comment)
Related issue(s)
Closes #1602
Checklist