-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix global score update bug in MultiLeafKnnCollector #13463
Conversation
There are corner cases where the min global score can incorrectly stay lower than it should be due to incorrect assumption that heaps are fully ordered.
@@ -103,8 +105,11 @@ public boolean collect(int docId, float similarity) { | |||
if (kResultsCollected) { | |||
// as we've collected k results, we can start do periodic updates with the global queue | |||
if (firstKResultsCollected || (subCollector.visitedCount() & interval) == 0) { | |||
cachedGlobalMinSim = globalSimilarityQueue.offer(updatesQueue.getHeap()); | |||
updatesQueue.clear(); | |||
for (int i = 0; i < k(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a comment explaining what this is up to? I think the idea is to "offer" a sorted array instead of a heap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks @msokolov for the suggestion. Added a comment to explain. But yes, #offer
expects a fully sorted array of values and has short-circuiting logic that depends on that assumption, but the backing array of a heap is only partially ordered, so you can hit edge-cases where #offer
incorrectly short-circuits too early.
@gsmiller I think your patch has a bug. I tried running with Lucene util to benchmark this to see if there is any perf change and got an exception. I am verifying my settings, but wanted to warn you sooner rather than later.
|
OK, maybe there is a bigger bug here, or this bug made this even better and we can reduce some constants to improve performance. I benchmarked over 1M 768 vectors, flushing every 12MB to get some segments. This resulted in 25 segments (some fairly small). With your change:
Baseline:
The patch for your change: Note: Lucene util is all mad at the latest Lucene updates in main, I will push a fix there soonish. |
To fix lucene util + latest lucene changes. |
It would be good to have @mayya-sharipova's input here. |
Ah @benwtrent good catch. Semi-sneaky that
|
My directories are:
Once you have the directories all set up:
Pro tip, build your index just once (via the EDIT: One more thing you might run into. Lucene doesn't supply hppc any longer and lucene-util just forcibly looks in your gradle cache. I actually had hppc 0.8.1, so I just adjusted which version lucene util was looking for (it looks for 0.9.1). If you already have hppc in your cache, then this shouldn't worry you. |
Thanks @benwtrent. As another data point, I ran
|
@gsmiller did you have more than one segment? This branch of the code only occurs if there is more than one segment. By default, the buffer size is 1GB, which for smaller datasets is perfectly fine for a single segment to be flushed. |
@benwtrent ah, you're right. I only had a single segment. I played with making the write buffer really small but couldn't get more than one segment with that 100d enwiki dataset. I ran with cohere data along with a 12MB write buffer to try to reproduce your results. I'm probably doing something wrong still, but I at least confirmed I had more than one segment in my index (ended up producing 16 in my run). I'll post the results I got with that dataset here, but I'm not sure I trust them at this point given the low recall being reported (I suspect I just have something wrong with my setup):
|
@gsmiller I have ran into those weird recall numbers in scenarios before:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsmiller Thanks for discovering and fixing this bug
I am wondering how did you discover that? I was thinking why the bug never manifested, and I realized I think it is probably impossible to get the situation you described in your test. Because as soon as we collect k results, we set up minAcceptedSimilarity
as the min score from globalQueue
, and new updates coming to updateQueue
can't contain values less than this min score.
Nevertheless, it is definitely a bug and worth fixing.
lock.lock(); | ||
try { | ||
for (int i = values.length - 1; i >= 0; i--) { | ||
for (int i = len - 1; i >= 0; i--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As alternative, we don't need to break, and always offer all values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I considered that as well and I don't really have a strong opinion either way. Offering everything without short-circuiting is probably a slightly cleaner/simpler solution so maybe that's a better way to go unless performance testing for some reason shows otherwise (but I find it had to imagine we'd see a big difference). That solution removes the need for the scratch array, which is nice.
@mayya-sharipova are you concerned at all that the performance gains we thought we had seem to disappear with this bug fix? Could you retest to verify? What I am seeing locally is almost no performance gains when it comes to number of vectors visited now. |
Static code analysis. I was digging through some of the code mostly out of curiosity, trying to understand the rapid progress being made in this space, and it just jumped out to me. Maybe it's not possible to repro in the wild? But I think it can happen... here's one way (I think... but please point out if this is wrong and I'm overlooking something): To borrow from the repro in the test case I added, imagine two independent collectors collecting from different segments concurrently, and imagine I think the crux of it is that a collector only establishes a new global min-score by getting information from the global heap after it does its flush. So rewinding a bit in the example, even if Again, this is just me working through what the code is doing based on reading through it. I could be missing something important. |
@benwtrent ++ to understanding the performance regression before pushing. I haven't made any more progress there personally. Agreed with waiting to merge until we understand what's going on there. |
OK, I did some more testing. My initial testing didn't fully exercise these paths as the segments were still very large. So, I switched to flushing at every 1MB. CohereV2 (1M, 768 dims, flushing every 1MB,
So, this PR is actually BETTER than baseline. Additionally, I ran this same index with NO multi-leaf collector: 36361 My previous experiments might have just hit a bad edge case where the difference between is so slight, the candidate is actually worse. I am gonna test with a different data set unless others beat me to it. Hopefully further testing proves out that this candidate is indeed overall better :). I would be very confused if it was truly worse. |
OK, I used the same methodology, but with CohereV3, 5M vectors.
No multi-leaf-collector: 47482 |
Thanks @benwtrent for the continued testing (just now saw this... was away for a few days). I'll work on getting this merged here in a little bit. (and thanks @mayya-sharipova for the review!) |
note: I see this tagged for 9.12.0 and it's now merged -- do we also intend to (or did we already?) backport to 9.x? |
Addresses the bug described in GH #13462
Relates to #12962