Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid overlapping trigrams in distanceHitIterator #779

Merged
merged 1 commit into from
May 14, 2024
Merged

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented May 7, 2024

We select the two least frequent trigrams to create the candidate match
iterator. It's common for these trigrams to overlap. This change shifts the
first and last trigrams to avoid overlap, which is guaranteed to result in a
smaller intersection. For frequent terms, this can substantially reduce the
number of candidate matches we consider.

@cla-bot cla-bot bot added the cla-signed label May 7, 2024
@jtibshirani
Copy link
Member Author

jtibshirani commented May 7, 2024

I noticed this while pondering how we could approximate IDF in BM25 scoring. Here are example searches and the stats before/ after. Looking at FilesLoaded, it can decrease as much as 25%. IndexBytesLoaded increases a little, but this is an okay trade-off, especially since it should be a 'warm' data structure.

repo:sourcegraph/sourcegraph monitors

stats:&{ContentBytesLoaded:12663867 IndexBytesLoaded:52663 Crashes:0 Duration:0s FileCount:150 ShardFilesConsidered:0 FilesConsidered:614 FilesLoaded:611 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:1002 NgramMatches:9364 NgramLookups:140 Wait:12.875µs MatchTreeConstruction:125.25µs MatchTreeSearch:15.413708ms RegexpsConsidered:0 FlushReason:final_flush}

stats:&{ContentBytesLoaded:9884794 IndexBytesLoaded:103460 Crashes:0 Duration:0s FileCount:150 ShardFilesConsidered:0 FilesConsidered:519 FilesLoaded:516 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:1002 NgramMatches:7454 NgramLookups:140 Wait:6.875µs MatchTreeConstruction:2.005ms MatchTreeSearch:69.575334ms RegexpsConsidered:0 FlushReason:timer_expired}

repo:sourcegraph/sourcegraph server

stats: &{ContentBytesLoaded:44493393 IndexBytesLoaded:159535 Crashes:0 Duration:0s FileCount:2366 ShardFilesConsidered:0 FilesConsidered:2744 FilesLoaded:2603 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:26106 NgramMatches:28695 NgramLookups:104 Wait:12.624µs MatchTreeConstruction:96.417µs MatchTreeSearch:64.717125ms RegexpsConsidered:0 FlushReason:timer_expired}

stats:&{ContentBytesLoaded:41522945 IndexBytesLoaded:423944 Crashes:0 Duration:0s FileCount:2370 ShardFilesConsidered:0 FilesConsidered:2370 FilesLoaded:2228 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:26106 NgramMatches:26540 NgramLookups:112 Wait:7.167µs MatchTreeConstruction:1.040542ms MatchTreeSearch:98.6855ms RegexpsConsidered:0 FlushReason:timer_expired}

repo:sourcegraph/sourcegraph testing

stats:&{ContentBytesLoaded:51781859 IndexBytesLoaded:165600 Crashes:0 Duration:0s FileCount:2594 ShardFilesConsidered:0 FilesConsidered:3528 FilesLoaded:3471 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:19119 NgramMatches:23980 NgramLookups:144 Wait:3.209µs MatchTreeConstruction:56.5µs MatchTreeSearch:63.062666ms RegexpsConsidered:0 FlushReason:final_flush}

stats:&{ContentBytesLoaded:30910204 IndexBytesLoaded:269406 Crashes:0 Duration:0s FileCount:2594 ShardFilesConsidered:0 FilesConsidered:2603 FilesLoaded:2542 FilesSkipped:0 ShardsScanned:1 ShardsSkipped:0 ShardsSkippedFilter:0 MatchCount:19131 NgramMatches:19372 NgramLookups:144 Wait:11.25µs MatchTreeConstruction:170.167µs MatchTreeSearch:52.275958ms RegexpsConsidered:0 FlushReason:final_flush}

@keegancsmith
Copy link
Member

Drive by, will review properly later. This is a great insight!! Totally worth potentially reading more index data. My instincts tell me this will make our fast queries slightly slower, but our slower queries a lot faster. So totally worth it.

My mind now jumps to algorithms here, and it seems instead we should be finding the minimum of f(x) + f(y) such that y-x >= 3. This can be implemented in O(n) time and O(1) space. Later I can sketch out the algorithm but gotta do a school run now :)

@keegancsmith
Copy link
Member

keegancsmith commented May 8, 2024

Alright that is my dopamine hit of the day done, back to typescript :)

from collections import deque

# ordered by appearance in string. This is the input
trigram_size = [10, 20, 10, 15, 30, 40]
assert len(trigram_size) > 3

best = (float('inf'), -1, -1)

# For the last 3 positions, what the minimum size is before it (including
# itself) and where that occurred. We use a deque here, but you can do some
# modulo math here instead since it has a fixed size.
window = deque([ (-1, float('inf')) ] * 3)
for i, size in enumerate(trigram_size):
    # Compute what the best answer (so far) is using this trigram and see if
    # it beats the best. If it ties we take it maximise distance.
    j, j_min = window.popleft()
    value = (j_min + size, j, i)
    if value <= best:
        best = value

    # We now append the smallest value seen so far taking into account this
    # trigram
    smallest_idx, smallest_size = window[-1]
    if size < smallest_size:
        smallest_idx, smallest_size = i, size
    window.append((smallest_idx, smallest_size))

print(f'minimum overlapping trigram pair has combined size {best[0]} at positions {best[1]} ({trigram_size[best[1]]}) and {best[2]} ({trigram_size[best[2]]})')

@jtibshirani
Copy link
Member Author

jtibshirani commented May 8, 2024

Thanks for looking! I agree with your problem description and algorithm. Let's give that a try and I'll see if I can get it fast enough. I believe we can totally replace minFrequencyNgramOffsets with the new algorithm, which is a good simplification.

@stefanhengl
Copy link
Member

It's common for [the two least frequent trigrams] to overlap

Do you know roughly how common?

This change shifts the first and last trigrams to avoid overlap, which is guaranteed to result in a smaller intersection.

I understand this is the case if we have full overlap, but is that true for partial overlap, too?

@stefanhengl
Copy link
Member

Did you check whether simple heuristics with O(1) runtime might be good enough? We could fall back to the first and last trigram if the original trigrams overlap.

@jtibshirani
Copy link
Member Author

jtibshirani commented May 10, 2024

Do you know roughly how common?

I do not. I just noticed it frequently when I was looking at natural-language style searches, which contain common keywords. I guess it would be possible to test this.

I understand this is the case if we have full overlap, but is that true for partial overlap, too?

Yes, this is true for partial overlap. This is because we create a "covering string". Let's say your original trigram intersection is abcde. The set of all lines containing abcdef (or any other covering string) is strictly less than or equal to abcde.

Did you check whether simple heuristics with O(1) runtime might be good enough? We could fall back to the first and last trigram if the original trigrams overlap.

Yes I tried this, and from the examples I looked at it could make things substantially worse. I liked the approach in this PR because it is guaranteed to result in a smaller intersection than what we currently do.

@jtibshirani jtibshirani marked this pull request as ready for review May 13, 2024 19:12
@jtibshirani
Copy link
Member Author

@keegancsmith I'd like to go with this change instead of implementing the alternate algorithm right now. I like that this change is simple and guarantees a strictly smaller candidate set than what we currently do. So marked it ready for review!

}
return
}

func minFrequencyNgramOffsets(ngramOffs []runeNgramOff, frequencies []uint32) (first, last runeNgramOff) {
firstI, lastI := min2Index(frequencies)
// If the frequencies are equal lets maximise distance in the query
Copy link
Member Author

@jtibshirani jtibshirani May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up, I'd like to try removing the "maximize distance" heuristic. I looked at some examples, and it's not clear to me that maximizing distance between "AAA" and "AAA" is better than filtering on "AAAAAA". I wonder if this was important before just because we commonly had overlapping trigrams. So before we would have a lot of overlap like "AAAA", whereas with my change this isn't a problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this optimization is likely unnecessary. I think it is only for the same trigram, and in that case I do think it makes a difference. But this seems like a very rare thing to happen and in the case it does happen it doesn't give us that much better perf.

For the examples of being different trigrams I don't see it making a difference that is worth the complexity.

}
return
}

func minFrequencyNgramOffsets(ngramOffs []runeNgramOff, frequencies []uint32) (first, last runeNgramOff) {
firstI, lastI := min2Index(frequencies)
// If the frequencies are equal lets maximise distance in the query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this optimization is likely unnecessary. I think it is only for the same trigram, and in that case I do think it makes a difference. But this seems like a very rare thing to happen and in the case it does happen it doesn't give us that much better perf.

For the examples of being different trigrams I don't see it making a difference that is worth the complexity.

@jtibshirani jtibshirani merged commit fe8f2a3 into main May 14, 2024
9 checks passed
@jtibshirani jtibshirani deleted the jtibs/trigrams branch May 14, 2024 17:00
jtibshirani added a commit that referenced this pull request May 15, 2024
Follow up to #779. This PR removes the logic for trigrams with the same
frequency, because it will no longer have a big effect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants