Avoid overlapping trigrams in distanceHitIterator #779

jtibshirani · 2024-05-07T19:38:40Z

We select the two least frequent trigrams to create the candidate match
iterator. It's common for these trigrams to overlap. This change shifts the
first and last trigrams to avoid overlap, which is guaranteed to result in a
smaller intersection. For frequent terms, this can substantially reduce the
number of candidate matches we consider.

jtibshirani · 2024-05-07T19:57:49Z

I noticed this while pondering how we could approximate IDF in BM25 scoring. Here are example searches and the stats before/ after. Looking at FilesLoaded, it can decrease as much as 25%. IndexBytesLoaded increases a little, but this is an okay trade-off, especially since it should be a 'warm' data structure.

`repo:sourcegraph/sourcegraph monitors`

stats:&{ContentBytesLoaded:12663867 IndexBytesLoaded:52663 Crashes:0 Duration:0s FileCount:150 ShardFilesConsidered:0 FilesConsidered:614 FilesLoaded:611 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:1002 NgramMatches:9364 NgramLookups:140 Wait:12.875µs MatchTreeConstruction:125.25µs MatchTreeSearch:15.413708ms RegexpsConsidered:0 FlushReason:final_flush}

stats:&{ContentBytesLoaded:9884794 IndexBytesLoaded:103460 Crashes:0 Duration:0s FileCount:150 ShardFilesConsidered:0 FilesConsidered:519 FilesLoaded:516 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:1002 NgramMatches:7454 NgramLookups:140 Wait:6.875µs MatchTreeConstruction:2.005ms MatchTreeSearch:69.575334ms RegexpsConsidered:0 FlushReason:timer_expired}

`repo:sourcegraph/sourcegraph server`

stats: &{ContentBytesLoaded:44493393 IndexBytesLoaded:159535 Crashes:0 Duration:0s FileCount:2366 ShardFilesConsidered:0 FilesConsidered:2744 FilesLoaded:2603 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:26106 NgramMatches:28695 NgramLookups:104 Wait:12.624µs MatchTreeConstruction:96.417µs MatchTreeSearch:64.717125ms RegexpsConsidered:0 FlushReason:timer_expired}

stats:&{ContentBytesLoaded:41522945 IndexBytesLoaded:423944 Crashes:0 Duration:0s FileCount:2370 ShardFilesConsidered:0 FilesConsidered:2370 FilesLoaded:2228 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:26106 NgramMatches:26540 NgramLookups:112 Wait:7.167µs MatchTreeConstruction:1.040542ms MatchTreeSearch:98.6855ms RegexpsConsidered:0 FlushReason:timer_expired}

`repo:sourcegraph/sourcegraph testing`

stats:&{ContentBytesLoaded:51781859 IndexBytesLoaded:165600 Crashes:0 Duration:0s FileCount:2594 ShardFilesConsidered:0 FilesConsidered:3528 FilesLoaded:3471 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:19119 NgramMatches:23980 NgramLookups:144 Wait:3.209µs MatchTreeConstruction:56.5µs MatchTreeSearch:63.062666ms RegexpsConsidered:0 FlushReason:final_flush}

stats:&{ContentBytesLoaded:30910204 IndexBytesLoaded:269406 Crashes:0 Duration:0s FileCount:2594 ShardFilesConsidered:0 FilesConsidered:2603 FilesLoaded:2542 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:19131 NgramMatches:19372 NgramLookups:144 Wait:11.25µs MatchTreeConstruction:170.167µs MatchTreeSearch:52.275958ms RegexpsConsidered:0 FlushReason:final_flush}

keegancsmith · 2024-05-08T05:21:38Z

Drive by, will review properly later. This is a great insight!! Totally worth potentially reading more index data. My instincts tell me this will make our fast queries slightly slower, but our slower queries a lot faster. So totally worth it.

My mind now jumps to algorithms here, and it seems instead we should be finding the minimum of f(x) + f(y) such that y-x >= 3. This can be implemented in O(n) time and O(1) space. Later I can sketch out the algorithm but gotta do a school run now :)

keegancsmith · 2024-05-08T07:52:59Z

Alright that is my dopamine hit of the day done, back to typescript :)

from collections import deque

# ordered by appearance in string. This is the input
trigram_size = [10, 20, 10, 15, 30, 40]
assert len(trigram_size) > 3

best = (float('inf'), -1, -1)

# For the last 3 positions, what the minimum size is before it (including
# itself) and where that occurred. We use a deque here, but you can do some
# modulo math here instead since it has a fixed size.
window = deque([ (-1, float('inf')) ] * 3)
for i, size in enumerate(trigram_size):
    # Compute what the best answer (so far) is using this trigram and see if
    # it beats the best. If it ties we take it maximise distance.
    j, j_min = window.popleft()
    value = (j_min + size, j, i)
    if value <= best:
        best = value

    # We now append the smallest value seen so far taking into account this
    # trigram
    smallest_idx, smallest_size = window[-1]
    if size < smallest_size:
        smallest_idx, smallest_size = i, size
    window.append((smallest_idx, smallest_size))

print(f'minimum overlapping trigram pair has combined size {best[0]} at positions {best[1]} ({trigram_size[best[1]]}) and {best[2]} ({trigram_size[best[2]]})')

jtibshirani · 2024-05-08T15:31:01Z

Thanks for looking! I agree with your problem description and algorithm. Let's give that a try and I'll see if I can get it fast enough. I believe we can totally replace minFrequencyNgramOffsets with the new algorithm, which is a good simplification.

stefanhengl · 2024-05-10T13:15:37Z

It's common for [the two least frequent trigrams] to overlap

Do you know roughly how common?

This change shifts the first and last trigrams to avoid overlap, which is guaranteed to result in a smaller intersection.

I understand this is the case if we have full overlap, but is that true for partial overlap, too?

stefanhengl · 2024-05-10T13:30:03Z

Did you check whether simple heuristics with O(1) runtime might be good enough? We could fall back to the first and last trigram if the original trigrams overlap.

jtibshirani · 2024-05-10T15:54:31Z

Do you know roughly how common?

I do not. I just noticed it frequently when I was looking at natural-language style searches, which contain common keywords. I guess it would be possible to test this.

I understand this is the case if we have full overlap, but is that true for partial overlap, too?

Yes, this is true for partial overlap. This is because we create a "covering string". Let's say your original trigram intersection is abcde. The set of all lines containing abcdef (or any other covering string) is strictly less than or equal to abcde.

Did you check whether simple heuristics with O(1) runtime might be good enough? We could fall back to the first and last trigram if the original trigrams overlap.

Yes I tried this, and from the examples I looked at it could make things substantially worse. I liked the approach in this PR because it is guaranteed to result in a smaller intersection than what we currently do.

jtibshirani · 2024-05-13T19:57:08Z

@keegancsmith I'd like to go with this change instead of implementing the alternate algorithm right now. I like that this change is simple and guarantees a strictly smaller candidate set than what we currently do. So marked it ready for review!

jtibshirani · 2024-05-13T19:59:26Z

indexdata.go

+	}
+	return
+}
+
 func minFrequencyNgramOffsets(ngramOffs []runeNgramOff, frequencies []uint32) (first, last runeNgramOff) {
 	firstI, lastI := min2Index(frequencies)
 	// If the frequencies are equal lets maximise distance in the query


In a follow-up, I'd like to try removing the "maximize distance" heuristic. I looked at some examples, and it's not clear to me that maximizing distance between "AAA" and "AAA" is better than filtering on "AAAAAA". I wonder if this was important before just because we commonly had overlapping trigrams. So before we would have a lot of overlap like "AAAA", whereas with my change this isn't a problem.

Yeah this optimization is likely unnecessary. I think it is only for the same trigram, and in that case I do think it makes a difference. But this seems like a very rare thing to happen and in the case it does happen it doesn't give us that much better perf.

For the examples of being different trigrams I don't see it making a difference that is worth the complexity.

keegancsmith · 2024-05-14T07:07:20Z

indexdata.go

+	}
+	return
+}
+
 func minFrequencyNgramOffsets(ngramOffs []runeNgramOff, frequencies []uint32) (first, last runeNgramOff) {
 	firstI, lastI := min2Index(frequencies)
 	// If the frequencies are equal lets maximise distance in the query


Yeah this optimization is likely unnecessary. I think it is only for the same trigram, and in that case I do think it makes a difference. But this seems like a very rare thing to happen and in the case it does happen it doesn't give us that much better perf.

For the examples of being different trigrams I don't see it making a difference that is worth the complexity.

Follow up to #779. This PR removes the logic for trigrams with the same frequency, because it will no longer have a big effect.

Avoid overlapping trigrams in distanceHitIterator

a35ccdc

cla-bot bot added the cla-signed label May 7, 2024

mmanela requested review from keegancsmith and stefanhengl May 7, 2024 20:30

jtibshirani marked this pull request as ready for review May 13, 2024 19:12

jtibshirani commented May 13, 2024

View reviewed changes

keegancsmith approved these changes May 14, 2024

View reviewed changes

stefanhengl approved these changes May 14, 2024

View reviewed changes

jtibshirani merged commit fe8f2a3 into main May 14, 2024
9 checks passed

jtibshirani deleted the jtibs/trigrams branch May 14, 2024 17:00

jtibshirani mentioned this pull request May 14, 2024

Simplify trigram selection in distanceHitIterator #782

Merged

jtibshirani added a commit that referenced this pull request May 15, 2024

Simplify trigram selection in distanceHitIterator (#782)

df7a7e7

Follow up to #779. This PR removes the logic for trigrams with the same frequency, because it will no longer have a big effect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid overlapping trigrams in distanceHitIterator #779

Avoid overlapping trigrams in distanceHitIterator #779

jtibshirani commented May 7, 2024 •

edited

Loading

jtibshirani commented May 7, 2024 •

edited

Loading

keegancsmith commented May 8, 2024

keegancsmith commented May 8, 2024 •

edited

Loading

jtibshirani commented May 8, 2024 •

edited

Loading

stefanhengl commented May 10, 2024

stefanhengl commented May 10, 2024

jtibshirani commented May 10, 2024 •

edited

Loading

jtibshirani commented May 13, 2024

jtibshirani May 13, 2024 •

edited

Loading

keegancsmith May 14, 2024

keegancsmith May 14, 2024

Avoid overlapping trigrams in distanceHitIterator #779

Avoid overlapping trigrams in distanceHitIterator #779

Conversation

jtibshirani commented May 7, 2024 • edited Loading

jtibshirani commented May 7, 2024 • edited Loading

repo:sourcegraph/sourcegraph monitors

repo:sourcegraph/sourcegraph server

repo:sourcegraph/sourcegraph testing

keegancsmith commented May 8, 2024

keegancsmith commented May 8, 2024 • edited Loading

jtibshirani commented May 8, 2024 • edited Loading

stefanhengl commented May 10, 2024

stefanhengl commented May 10, 2024

jtibshirani commented May 10, 2024 • edited Loading

jtibshirani commented May 13, 2024

jtibshirani May 13, 2024 • edited Loading

Choose a reason for hiding this comment

keegancsmith May 14, 2024

Choose a reason for hiding this comment

keegancsmith May 14, 2024

Choose a reason for hiding this comment

jtibshirani commented May 7, 2024 •

edited

Loading

jtibshirani commented May 7, 2024 •

edited

Loading

`repo:sourcegraph/sourcegraph monitors`

`repo:sourcegraph/sourcegraph server`

`repo:sourcegraph/sourcegraph testing`

keegancsmith commented May 8, 2024 •

edited

Loading

jtibshirani commented May 8, 2024 •

edited

Loading

jtibshirani commented May 10, 2024 •

edited

Loading

jtibshirani May 13, 2024 •

edited

Loading