Fix global score update bug in MultiLeafKnnCollector #13463

gsmiller · 2024-06-06T14:21:34Z

Addresses the bug described in GH #13462

Relates to #12962

There are corner cases where the min global score can incorrectly stay lower than it should be due to incorrect assumption that heaps are fully ordered.

msokolov · 2024-06-07T13:51:23Z

lucene/core/src/java/org/apache/lucene/search/knn/MultiLeafKnnCollector.java

@@ -103,8 +105,11 @@ public boolean collect(int docId, float similarity) {
    if (kResultsCollected) {
      // as we've collected k results, we can start do periodic updates with the global queue
      if (firstKResultsCollected || (subCollector.visitedCount() & interval) == 0) {
-        cachedGlobalMinSim = globalSimilarityQueue.offer(updatesQueue.getHeap());
-        updatesQueue.clear();
+        for (int i = 0; i < k(); i++) {


could you add a comment explaining what this is up to? I think the idea is to "offer" a sorted array instead of a heap?

Yeah, thanks @msokolov for the suggestion. Added a comment to explain. But yes, #offer expects a fully sorted array of values and has short-circuiting logic that depends on that assumption, but the backing array of a heap is only partially ordered, so you can hit edge-cases where #offer incorrectly short-circuits too early.

benwtrent · 2024-06-07T20:16:22Z

@gsmiller I think your patch has a bug. I tried running with Lucene util to benchmark this to see if there is any perf change and got an exception. I am verifying my settings, but wanted to warn you sooner rather than later.

Exception in thread "main" java.lang.IllegalStateException: The heap is empty
	at org.apache.lucene.util.hnsw.FloatHeap.poll(FloatHeap.java:82)
	at org.apache.lucene.search.knn.MultiLeafKnnCollector.collect(MultiLeafKnnCollector.java:112)
	at org.apache.lucene.util.hnsw.OrdinalTranslatedKnnCollector.collect(OrdinalTranslatedKnnCollector.java:64)
	at org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(HnswGraphSearcher.java:242)
	at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:105)
	at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:70)
	at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader.search(Lucene99HnswVectorsReader.java:263)
	at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:275)
	at org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:262)
	at org.apache.lucene.search.KnnFloatVectorQuery.approximateSearch(KnnFloatVectorQuery.java:95)
	at org.apache.lucene.search.AbstractKnnVectorQuery.getLeafResults(AbstractKnnVectorQuery.java:127)
	at org.apache.lucene.search.AbstractKnnVectorQuery.searchLeaf(AbstractKnnVectorQuery.java:109)
	at org.apache.lucene.search.AbstractKnnVectorQuery.lambda$rewrite$0(AbstractKnnVectorQuery.java:92)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$createTask$0(TaskExecutor.java:117)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:152)
	at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:76)
	at org.apache.lucene.search.AbstractKnnVectorQuery.rewrite(AbstractKnnVectorQuery.java:94)
	at org.apache.lucene.search.KnnFloatVectorQuery.rewrite(KnnFloatVectorQuery.java:45)
	at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:741)
	at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:752)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:634)
	at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:483)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:506)
	at knn.KnnGraphTester.doKnnVectorQuery(Unknown Source)
	at knn.KnnGraphTester.testSearch(Unknown Source)
	at knn.KnnGraphTester.run(Unknown Source)
	at knn.KnnGraphTester.runWithCleanUp(Unknown Source)
	at knn.KnnGraphTester.main(Unknown Source)

benwtrent · 2024-06-07T20:33:28Z

OK, maybe there is a bigger bug here, or this bug made this even better and we can reduce some constants to improve performance.

I benchmarked over 1M 768 vectors, flushing every 12MB to get some segments. This resulted in 25 segments (some fairly small).

With your change:

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	
0.909	12.03	1000000	0	16	100	        31658

Baseline:

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	
0.875	 5.37	1000000	0	16	100	        13222

The patch for your change:
patch.patch

Note: Lucene util is all mad at the latest Lucene updates in main, I will push a fix there soonish.

benwtrent · 2024-06-07T20:39:39Z

mikemccand/luceneutil#270

To fix lucene util + latest lucene changes.

benwtrent · 2024-06-07T20:55:00Z

It would be good to have @mayya-sharipova's input here.

gsmiller · 2024-06-07T22:54:22Z

Ah @benwtrent good catch. Semi-sneaky that updatesQueue can have fewer than k results when the global update happens, but that makes sense. A couple of things:

I brought in your patch (tweaked just a little bit) into this PR.
Seems like a test would be useful to cover the bug you exposed. I'll try to add something here to cover that soon.
~~Could you share any details on how you're running luceneutil? I don't really have experience benchmarking in the KNN space (trying to "learn to fish" here). Thanks!~~
Hmm... never mind on (3). I found this documentation so I'll play around with that first.

benwtrent · 2024-06-08T00:07:04Z

@gsmiller

My directories are:

<common_parent_path>/candidate <- Lucene branch
<common_parent_path>/baseline <- Lucene main
<common_parent_path>/util <- lucene util

Once you have the directories all set up:

ant build to compile whenever you adjust things and before your first run. For this particular test, I went into KnnIndexer.java and adjusted WRITER_BUFFER_MB down to 12MB
python ./src/python/knnPerfTest.py to actually run the test, but you probably need some data and need to point it to some data.
cohere_download_and_format.zip is a bash to download a bunch of parquet & then a python script to format them for ingesting. I think this might download cohere v3 (1024 dims, dot_product for the similarity).
For knnPerfTest, I adjust it to then look at the train and test set I just built, and adjust whatever settings I care about.

Pro tip, build your index just once (via the reindex parameter in knnPerfTest) but then you can do your candidate vs baseline queries against it which is WAY faster.

EDIT: One more thing you might run into. Lucene doesn't supply hppc any longer and lucene-util just forcibly looks in your gradle cache. I actually had hppc 0.8.1, so I just adjusted which version lucene util was looking for (it looks for 0.9.1). If you already have hppc in your cache, then this shouldn't worry you.

gsmiller · 2024-06-10T23:47:41Z

Thanks @benwtrent. As another data point, I ran knnPerfTest with the vectors that luceneutil downloads as part of setup.py (enwiki-20120502-lines-1k-100d.vec / vector-task-100d.vec) with the "stock" parameters in knnPerfTest. I didn't tweak the write buffer at all. Was just getting setup to run performance tests. I'll see if I can reproduce the results you're seeing with Cohere next. In the meantime, this is what I saw with the "stock" data:

BASELINE
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.897    0.19   10000   0       64      250     718     2466    1.00    post-filter
0.820    0.29   100000  0       64      250     1023    37118   1.00    post-filter
0.801    0.43   200000  0       64      250     1139    94993   1.00    post-filter

CANDIDATE
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.897    0.19   10000   0       64      250     718     2462    1.00    post-filter
0.820    0.29   100000  0       64      250     1023    36852   1.00    post-filter
0.801    0.36   200000  0       64      250     1139    94398   1.00    post-filter

benwtrent · 2024-06-11T00:16:33Z

@gsmiller did you have more than one segment?

This branch of the code only occurs if there is more than one segment.

By default, the buffer size is 1GB, which for smaller datasets is perfectly fine for a single segment to be flushed.

gsmiller · 2024-06-11T15:15:54Z

@benwtrent ah, you're right. I only had a single segment. I played with making the write buffer really small but couldn't get more than one segment with that 100d enwiki dataset. I ran with cohere data along with a 12MB write buffer to try to reproduce your results. I'm probably doing something wrong still, but I at least confirmed I had more than one segment in my index (ended up producing 16 in my run). I'll post the results I got with that dataset here, but I'm not sure I trust them at this point given the low recall being reported (I suspect I just have something wrong with my setup):

BASELINE
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.385   13.05   1000000 0       16      100     22696   255473  1.00    post-filter

CANDIDATE
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.383   13.60   1000000 0       16      100     23645   249901  1.00    post-filter

benwtrent · 2024-06-11T15:19:23Z

@gsmiller I have ran into those weird recall numbers in scenarios before:

My vector data was corrupted and thus created many 0 valued vectors
My dimensions were incorrectly configured and I was using 100 or 768 dimensions when I should have been using 1024 or 384.

mayya-sharipova

@gsmiller Thanks for discovering and fixing this bug

I am wondering how did you discover that? I was thinking why the bug never manifested, and I realized I think it is probably impossible to get the situation you described in your test. Because as soon as we collect k results, we set up minAcceptedSimilarity as the min score from globalQueue, and new updates coming to updateQueue can't contain values less than this min score.

Nevertheless, it is definitely a bug and worth fixing.

mayya-sharipova · 2024-06-11T19:15:16Z

lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java

    lock.lock();
    try {
-      for (int i = values.length - 1; i >= 0; i--) {
+      for (int i = len - 1; i >= 0; i--) {


As alternative, we don't need to break, and always offer all values.

Yeah, I considered that as well and I don't really have a strong opinion either way. Offering everything without short-circuiting is probably a slightly cleaner/simpler solution so maybe that's a better way to go unless performance testing for some reason shows otherwise (but I find it had to imagine we'd see a big difference). That solution removes the need for the scratch array, which is nice.

benwtrent · 2024-06-11T20:10:10Z

@mayya-sharipova are you concerned at all that the performance gains we thought we had seem to disappear with this bug fix?

Could you retest to verify? What I am seeing locally is almost no performance gains when it comes to number of vectors visited now.

gsmiller · 2024-06-11T22:46:17Z

@mayya-sharipova:

how did you discover that?

Static code analysis. I was digging through some of the code mostly out of curiosity, trying to understand the rapid progress being made in this space, and it just jumped out to me.

Maybe it's not possible to repro in the wild? But I think it can happen... here's one way (I think... but please point out if this is wrong and I'm overlooking something):

To borrow from the repro in the test case I added, imagine two independent collectors collecting from different segments concurrently, and imagine k == 7. collector1 collects hits with similarities of [100, 101, 102, 103, 104, 105, 106] and collector2 collects [10, 11, 12, 13, 14, 200, 300]. Let's say this happens more-or-less in parallel with an empty global heap (nothing has updated global state yet). Imagine collector1 gets the lock on the global update first, so at that point the global heap contains [100, 101, 102, 103, 104, 105, 106] and collector1 now has a min-score of 100 (which it actually already had before the global update just based on its local information). At this point in time, the min-score for collector2 is still 10 based purely on its local information. It then acquires the lock and does its update to the global heap. What should happen is that both 200 and 300 should end up in the global heap, pushing out 100 and 101, establishing a min-score for collector2 of 102. But because of the bug, and the fact that the memory layout of collector2's heap is [10, 11, 12, 13, 200, 14, 300], only 300 will get added to the global heap and the min-score for collector2 will be 101.

I think the crux of it is that a collector only establishes a new global min-score by getting information from the global heap after it does its flush. So rewinding a bit in the example, even if collector1 collected its first k hits before collector2 even gets started, and the global heap "knows" about a min-score of 100, collector2 doesn't get this information until after it flushes its local hits, so nothing prevents collector2 from collecting [10, 11, 12, 13, 14].

Again, this is just me working through what the code is doing based on reading through it. I could be missing something important.

gsmiller · 2024-06-11T22:59:34Z

@benwtrent ++ to understanding the performance regression before pushing. I haven't made any more progress there personally. Agreed with waiting to merge until we understand what's going on there.

benwtrent · 2024-06-13T19:58:18Z

@gsmiller @mayya-sharipova

OK, I did some more testing. My initial testing didn't fully exercise these paths as the segments were still very large. So, I switched to flushing at every 1MB.

CohereV2 (1M, 768 dims, flushing every 1MB, mip similarity).

fanout ->	0	10	50	100	200
candidate	12715	13319	15642	18312	23225
baseline	15759	16514	19449	22457	27245

So, this PR is actually BETTER than baseline.

Additionally, I ran this same index with NO multi-leaf collector: 36361

My previous experiments might have just hit a bad edge case where the difference between is so slight, the candidate is actually worse.

I am gonna test with a different data set unless others beat me to it.

Hopefully further testing proves out that this candidate is indeed overall better :). I would be very confused if it was truly worse.

benwtrent · 2024-06-14T00:31:10Z

OK, I used the same methodology, but with CohereV3, 5M vectors.

fanout ->	0	10	50	100	200
candidate	18724	19725	23761	28416	36443
baseline	21691	22932	28120	33488	42869

No multi-leaf-collector: 47482

gsmiller · 2024-06-18T16:11:34Z

Thanks @benwtrent for the continued testing (just now saw this... was away for a few days). I'll work on getting this merged here in a little bit. (and thanks @mayya-sharipova for the review!)

msokolov · 2024-06-21T14:38:55Z

note: I see this tagged for 9.12.0 and it's now merged -- do we also intend to (or did we already?) backport to 9.x?

benwtrent · 2024-06-21T17:49:38Z

@msokolov commit 6a5fd8b from 3 days ago seems like a backport to 9.12

Greg Miller added 6 commits June 6, 2024 06:55

Demonstrate bug in global score coordination in MultiLeafKnnCollector

0e1788a

There are corner cases where the min global score can incorrectly stay lower than it should be due to incorrect assumption that heaps are fully ordered.

add explanation comment

a8a7876

propose fix

5594cec

license header

049597d

changes

7c9288a

remove unnecessary heap clear

69fbe50

benwtrent requested a review from mayya-sharipova June 7, 2024 12:36

msokolov reviewed Jun 7, 2024

View reviewed changes

add comment

e642dee

bring in fix from benwtrent

65623a1

mayya-sharipova approved these changes Jun 11, 2024

View reviewed changes

benwtrent approved these changes Jun 17, 2024

View reviewed changes

gsmiller merged commit 937c004 into apache:main Jun 19, 2024
3 checks passed

gsmiller deleted the hnsw/collector-bug-fix-pr branch June 19, 2024 01:34

gsmiller added a commit that referenced this pull request Jun 19, 2024

Fix global score update bug in MultiLeafKnnCollector (#13463)

6a5fd8b

gsmiller added this to the 9.12.0 milestone Jun 19, 2024

gsmiller mentioned this pull request Jun 19, 2024

Bug in MultiLeafKnnCollector causes #minCompetitiveSimilarity to stay artificially low in some situations #13462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix global score update bug in MultiLeafKnnCollector #13463

Fix global score update bug in MultiLeafKnnCollector #13463

gsmiller commented Jun 6, 2024 •

edited by mayya-sharipova

Loading

msokolov Jun 7, 2024

gsmiller Jun 7, 2024

benwtrent commented Jun 7, 2024

benwtrent commented Jun 7, 2024 •

edited

Loading

benwtrent commented Jun 7, 2024

benwtrent commented Jun 7, 2024

gsmiller commented Jun 7, 2024 •

edited

Loading

benwtrent commented Jun 8, 2024 •

edited

Loading

gsmiller commented Jun 10, 2024

benwtrent commented Jun 11, 2024

gsmiller commented Jun 11, 2024

benwtrent commented Jun 11, 2024

mayya-sharipova left a comment

mayya-sharipova Jun 11, 2024

gsmiller Jun 11, 2024

benwtrent commented Jun 11, 2024

gsmiller commented Jun 11, 2024 •

edited

Loading

gsmiller commented Jun 11, 2024

benwtrent commented Jun 13, 2024

benwtrent commented Jun 14, 2024

gsmiller commented Jun 18, 2024

msokolov commented Jun 21, 2024

benwtrent commented Jun 21, 2024

Fix global score update bug in MultiLeafKnnCollector #13463

Fix global score update bug in MultiLeafKnnCollector #13463

Conversation

gsmiller commented Jun 6, 2024 • edited by mayya-sharipova Loading

msokolov Jun 7, 2024

Choose a reason for hiding this comment

gsmiller Jun 7, 2024

Choose a reason for hiding this comment

benwtrent commented Jun 7, 2024

benwtrent commented Jun 7, 2024 • edited Loading

benwtrent commented Jun 7, 2024

benwtrent commented Jun 7, 2024

gsmiller commented Jun 7, 2024 • edited Loading

benwtrent commented Jun 8, 2024 • edited Loading

gsmiller commented Jun 10, 2024

benwtrent commented Jun 11, 2024

gsmiller commented Jun 11, 2024

benwtrent commented Jun 11, 2024

mayya-sharipova left a comment

Choose a reason for hiding this comment

mayya-sharipova Jun 11, 2024

Choose a reason for hiding this comment

gsmiller Jun 11, 2024

Choose a reason for hiding this comment

benwtrent commented Jun 11, 2024

gsmiller commented Jun 11, 2024 • edited Loading

gsmiller commented Jun 11, 2024

benwtrent commented Jun 13, 2024

benwtrent commented Jun 14, 2024

gsmiller commented Jun 18, 2024

msokolov commented Jun 21, 2024

benwtrent commented Jun 21, 2024

gsmiller commented Jun 6, 2024 •

edited by mayya-sharipova

Loading

benwtrent commented Jun 7, 2024 •

edited

Loading

gsmiller commented Jun 7, 2024 •

edited

Loading

benwtrent commented Jun 8, 2024 •

edited

Loading

gsmiller commented Jun 11, 2024 •

edited

Loading