-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simplify deduplication logic #1607
Conversation
ca0c541
to
94517ed
Compare
94517ed
to
9d7ddf5
Compare
This PR improves the deduplication of https://pelias.github.io/compare/#/v1/autocomplete?text=berlin |
93ec12d
to
5bc912d
Compare
5bc912d
to
560fff8
Compare
Good-to-go! I think this is a positive change which should help reduce some of the noisier deduplication cases that have been plaguing us for some time now 🎉 |
housekeeping: the dedupe-without-skiplist branch can probably be deleted after merging this. |
Cool, I really love the simplicity of this logic. Even if it's technically more computationally expensive (though I would imagine the effect is not measurable), it's much much easier to understand. I do think the case you mentioned with transitive deduplication has prevented a lot of our deduplication checks from performing as we expect. My guess is this is good to merge but let me try it out with some of the other recent changes I've been experimenting with, like #1609 and #1606. |
Okay, after some fairly complex testing I can confirm this PR improves deduplication, or more precisely it allows some of the newer deduplication strategies we are experimenting with to be more effective. Here's a summary of some results from testing with pelias/acceptance-tests#557
The interesting thing to note is that while this PR by itself does not change the test results, nor does it change the results when combined with either of our two other deduplication PRs, when all three PRs are combined, there is an improvement over not using this PR. So that seems to confirm to me the idea that transitive deduplication is important. One last thing to test is that this PR (as well as a combination of this and the other PRs) does not introduce regressions. That info will be coming soon, then I think we are good to go. |
This test case came out of evaluating pelias/api#1607. It passes with the current API master branch, fails with pelias/api#1607, but passes with pelias/api#1612
this PR greatly simplifies the deduplication logic.
The main difference is that previously we were focussing on keeping the superior rows and now we're focussing on removing the inferior rows, a subtle but important distinction!
In particular the new logic makes one positive functional change:
in the case where A dedupes B and B dedupes C, prior to this PR the C record may or may not remain in the results, depending on the order, due to B being present in the 'skip list'.
the unit test included demonstrates this ^ as it fails prior to this PR