Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix global score update bug in MultiLeafKnnCollector #13463

Merged
merged 8 commits into from
Jun 19, 2024

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented Jun 6, 2024

Addresses the bug described in GH #13462

Relates to #12962

Greg Miller added 6 commits June 6, 2024 06:55
There are corner cases where the min global score can incorrectly stay
lower than it should be due to incorrect assumption that heaps are fully
ordered.
@benwtrent benwtrent requested a review from mayya-sharipova June 7, 2024 12:36
@@ -103,8 +105,11 @@ public boolean collect(int docId, float similarity) {
if (kResultsCollected) {
// as we've collected k results, we can start do periodic updates with the global queue
if (firstKResultsCollected || (subCollector.visitedCount() & interval) == 0) {
cachedGlobalMinSim = globalSimilarityQueue.offer(updatesQueue.getHeap());
updatesQueue.clear();
for (int i = 0; i < k(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment explaining what this is up to? I think the idea is to "offer" a sorted array instead of a heap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks @msokolov for the suggestion. Added a comment to explain. But yes, #offer expects a fully sorted array of values and has short-circuiting logic that depends on that assumption, but the backing array of a heap is only partially ordered, so you can hit edge-cases where #offer incorrectly short-circuits too early.

@benwtrent
Copy link
Member

@gsmiller I think your patch has a bug. I tried running with Lucene util to benchmark this to see if there is any perf change and got an exception. I am verifying my settings, but wanted to warn you sooner rather than later.

Exception in thread "main" java.lang.IllegalStateException: The heap is empty
	at org.apache.lucene.util.hnsw.FloatHeap.poll(FloatHeap.java:82)
	at org.apache.lucene.search.knn.MultiLeafKnnCollector.collect(MultiLeafKnnCollector.java:112)
	at org.apache.lucene.util.hnsw.OrdinalTranslatedKnnCollector.collect(OrdinalTranslatedKnnCollector.java:64)
	at org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(HnswGraphSearcher.java:242)
	at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:105)
	at org.apache.lucene.util.hnsw.HnswGraphSearcher.search(HnswGraphSearcher.java:70)
	at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader.search(Lucene99HnswVectorsReader.java:263)
	at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.search(PerFieldKnnVectorsFormat.java:275)
	at org.apache.lucene.index.CodecReader.searchNearestVectors(CodecReader.java:262)
	at org.apache.lucene.search.KnnFloatVectorQuery.approximateSearch(KnnFloatVectorQuery.java:95)
	at org.apache.lucene.search.AbstractKnnVectorQuery.getLeafResults(AbstractKnnVectorQuery.java:127)
	at org.apache.lucene.search.AbstractKnnVectorQuery.searchLeaf(AbstractKnnVectorQuery.java:109)
	at org.apache.lucene.search.AbstractKnnVectorQuery.lambda$rewrite$0(AbstractKnnVectorQuery.java:92)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$createTask$0(TaskExecutor.java:117)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:152)
	at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:76)
	at org.apache.lucene.search.AbstractKnnVectorQuery.rewrite(AbstractKnnVectorQuery.java:94)
	at org.apache.lucene.search.KnnFloatVectorQuery.rewrite(KnnFloatVectorQuery.java:45)
	at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:741)
	at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:752)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:634)
	at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:483)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:506)
	at knn.KnnGraphTester.doKnnVectorQuery(Unknown Source)
	at knn.KnnGraphTester.testSearch(Unknown Source)
	at knn.KnnGraphTester.run(Unknown Source)
	at knn.KnnGraphTester.runWithCleanUp(Unknown Source)
	at knn.KnnGraphTester.main(Unknown Source)

@benwtrent
Copy link
Member

benwtrent commented Jun 7, 2024

OK, maybe there is a bigger bug here, or this bug made this even better and we can reduce some constants to improve performance.

I benchmarked over 1M 768 vectors, flushing every 12MB to get some segments. This resulted in 25 segments (some fairly small).

With your change:

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	
0.909	12.03	1000000	0	16	100	        31658

Baseline:

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	
0.875	 5.37	1000000	0	16	100	        13222

The patch for your change:
patch.patch

Note: Lucene util is all mad at the latest Lucene updates in main, I will push a fix there soonish.

@benwtrent
Copy link
Member

mikemccand/luceneutil#270

To fix lucene util + latest lucene changes.

@benwtrent
Copy link
Member

It would be good to have @mayya-sharipova's input here.

@gsmiller
Copy link
Contributor Author

gsmiller commented Jun 7, 2024

Ah @benwtrent good catch. Semi-sneaky that updatesQueue can have fewer than k results when the global update happens, but that makes sense. A couple of things:

  1. I brought in your patch (tweaked just a little bit) into this PR.
  2. Seems like a test would be useful to cover the bug you exposed. I'll try to add something here to cover that soon.
  3. Could you share any details on how you're running luceneutil? I don't really have experience benchmarking in the KNN space (trying to "learn to fish" here). Thanks!
  4. Hmm... never mind on (3). I found this documentation so I'll play around with that first.

@benwtrent
Copy link
Member

benwtrent commented Jun 8, 2024

@gsmiller

My directories are:

<common_parent_path>/candidate <- Lucene branch
<common_parent_path>/baseline <- Lucene main
<common_parent_path>/util <- lucene util

Once you have the directories all set up:

  • ant build to compile whenever you adjust things and before your first run. For this particular test, I went into KnnIndexer.java and adjusted WRITER_BUFFER_MB down to 12MB
  • python ./src/python/knnPerfTest.py to actually run the test, but you probably need some data and need to point it to some data.
  • cohere_download_and_format.zip is a bash to download a bunch of parquet & then a python script to format them for ingesting. I think this might download cohere v3 (1024 dims, dot_product for the similarity).
  • For knnPerfTest, I adjust it to then look at the train and test set I just built, and adjust whatever settings I care about.

Pro tip, build your index just once (via the reindex parameter in knnPerfTest) but then you can do your candidate vs baseline queries against it which is WAY faster.

EDIT: One more thing you might run into. Lucene doesn't supply hppc any longer and lucene-util just forcibly looks in your gradle cache. I actually had hppc 0.8.1, so I just adjusted which version lucene util was looking for (it looks for 0.9.1). If you already have hppc in your cache, then this shouldn't worry you.

@gsmiller
Copy link
Contributor Author

Thanks @benwtrent. As another data point, I ran knnPerfTest with the vectors that luceneutil downloads as part of setup.py (enwiki-20120502-lines-1k-100d.vec / vector-task-100d.vec) with the "stock" parameters in knnPerfTest. I didn't tweak the write buffer at all. Was just getting setup to run performance tests. I'll see if I can reproduce the results you're seeing with Cohere next. In the meantime, this is what I saw with the "stock" data:

BASELINE
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.897    0.19   10000   0       64      250     718     2466    1.00    post-filter
0.820    0.29   100000  0       64      250     1023    37118   1.00    post-filter
0.801    0.43   200000  0       64      250     1139    94993   1.00    post-filter

CANDIDATE
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.897    0.19   10000   0       64      250     718     2462    1.00    post-filter
0.820    0.29   100000  0       64      250     1023    36852   1.00    post-filter
0.801    0.36   200000  0       64      250     1139    94398   1.00    post-filter

@benwtrent
Copy link
Member

@gsmiller did you have more than one segment?

This branch of the code only occurs if there is more than one segment.

By default, the buffer size is 1GB, which for smaller datasets is perfectly fine for a single segment to be flushed.

@gsmiller
Copy link
Contributor Author

@benwtrent ah, you're right. I only had a single segment. I played with making the write buffer really small but couldn't get more than one segment with that 100d enwiki dataset. I ran with cohere data along with a 12MB write buffer to try to reproduce your results. I'm probably doing something wrong still, but I at least confirmed I had more than one segment in my index (ended up producing 16 in my run). I'll post the results I got with that dataset here, but I'm not sure I trust them at this point given the low recall being reported (I suspect I just have something wrong with my setup):

BASELINE
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.385   13.05   1000000 0       16      100     22696   255473  1.00    post-filter

CANDIDATE
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.383   13.60   1000000 0       16      100     23645   249901  1.00    post-filter

@benwtrent
Copy link
Member

@gsmiller I have ran into those weird recall numbers in scenarios before:

  • My vector data was corrupted and thus created many 0 valued vectors
  • My dimensions were incorrectly configured and I was using 100 or 768 dimensions when I should have been using 1024 or 384.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsmiller Thanks for discovering and fixing this bug

I am wondering how did you discover that? I was thinking why the bug never manifested, and I realized I think it is probably impossible to get the situation you described in your test. Because as soon as we collect k results, we set up minAcceptedSimilarity as the min score from globalQueue, and new updates coming to updateQueue can't contain values less than this min score.

Nevertheless, it is definitely a bug and worth fixing.

lock.lock();
try {
for (int i = values.length - 1; i >= 0; i--) {
for (int i = len - 1; i >= 0; i--) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As alternative, we don't need to break, and always offer all values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I considered that as well and I don't really have a strong opinion either way. Offering everything without short-circuiting is probably a slightly cleaner/simpler solution so maybe that's a better way to go unless performance testing for some reason shows otherwise (but I find it had to imagine we'd see a big difference). That solution removes the need for the scratch array, which is nice.

@benwtrent
Copy link
Member

@mayya-sharipova are you concerned at all that the performance gains we thought we had seem to disappear with this bug fix?

Could you retest to verify? What I am seeing locally is almost no performance gains when it comes to number of vectors visited now.

@gsmiller
Copy link
Contributor Author

gsmiller commented Jun 11, 2024

@mayya-sharipova:

how did you discover that?

Static code analysis. I was digging through some of the code mostly out of curiosity, trying to understand the rapid progress being made in this space, and it just jumped out to me.

Maybe it's not possible to repro in the wild? But I think it can happen... here's one way (I think... but please point out if this is wrong and I'm overlooking something):

To borrow from the repro in the test case I added, imagine two independent collectors collecting from different segments concurrently, and imagine k == 7. collector1 collects hits with similarities of [100, 101, 102, 103, 104, 105, 106] and collector2 collects [10, 11, 12, 13, 14, 200, 300]. Let's say this happens more-or-less in parallel with an empty global heap (nothing has updated global state yet). Imagine collector1 gets the lock on the global update first, so at that point the global heap contains [100, 101, 102, 103, 104, 105, 106] and collector1 now has a min-score of 100 (which it actually already had before the global update just based on its local information). At this point in time, the min-score for collector2 is still 10 based purely on its local information. It then acquires the lock and does its update to the global heap. What should happen is that both 200 and 300 should end up in the global heap, pushing out 100 and 101, establishing a min-score for collector2 of 102. But because of the bug, and the fact that the memory layout of collector2's heap is [10, 11, 12, 13, 200, 14, 300], only 300 will get added to the global heap and the min-score for collector2 will be 101.

I think the crux of it is that a collector only establishes a new global min-score by getting information from the global heap after it does its flush. So rewinding a bit in the example, even if collector1 collected its first k hits before collector2 even gets started, and the global heap "knows" about a min-score of 100, collector2 doesn't get this information until after it flushes its local hits, so nothing prevents collector2 from collecting [10, 11, 12, 13, 14].

Again, this is just me working through what the code is doing based on reading through it. I could be missing something important.

@gsmiller
Copy link
Contributor Author

@benwtrent ++ to understanding the performance regression before pushing. I haven't made any more progress there personally. Agreed with waiting to merge until we understand what's going on there.

@benwtrent
Copy link
Member

@gsmiller @mayya-sharipova

OK, I did some more testing. My initial testing didn't fully exercise these paths as the segments were still very large. So, I switched to flushing at every 1MB.

CohereV2 (1M, 768 dims, flushing every 1MB, mip similarity).

fanout -> 0 10 50 100 200
candidate 12715 13319 15642 18312 23225
baseline 15759 16514 19449 22457 27245

So, this PR is actually BETTER than baseline.

Additionally, I ran this same index with NO multi-leaf collector: 36361

My previous experiments might have just hit a bad edge case where the difference between is so slight, the candidate is actually worse.

I am gonna test with a different data set unless others beat me to it.

Hopefully further testing proves out that this candidate is indeed overall better :). I would be very confused if it was truly worse.

@benwtrent
Copy link
Member

OK, I used the same methodology, but with CohereV3, 5M vectors.

fanout -> 0 10 50 100 200
candidate 18724 19725 23761 28416 36443
baseline 21691 22932 28120 33488 42869

No multi-leaf-collector: 47482

@gsmiller
Copy link
Contributor Author

Thanks @benwtrent for the continued testing (just now saw this... was away for a few days). I'll work on getting this merged here in a little bit. (and thanks @mayya-sharipova for the review!)

@gsmiller gsmiller merged commit 937c004 into apache:main Jun 19, 2024
3 checks passed
@gsmiller gsmiller deleted the hnsw/collector-bug-fix-pr branch June 19, 2024 01:34
@msokolov
Copy link
Contributor

note: I see this tagged for 9.12.0 and it's now merged -- do we also intend to (or did we already?) backport to 9.x?

@benwtrent
Copy link
Member

@msokolov commit 6a5fd8b from 3 days ago seems like a backport to 9.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants