-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid overlapping trigrams in distanceHitIterator #779
Conversation
I noticed this while pondering how we could approximate IDF in BM25 scoring. Here are example searches and the stats before/ after. Looking at
|
Drive by, will review properly later. This is a great insight!! Totally worth potentially reading more index data. My instincts tell me this will make our fast queries slightly slower, but our slower queries a lot faster. So totally worth it. My mind now jumps to algorithms here, and it seems instead we should be finding the minimum of |
Alright that is my dopamine hit of the day done, back to typescript :) from collections import deque
# ordered by appearance in string. This is the input
trigram_size = [10, 20, 10, 15, 30, 40]
assert len(trigram_size) > 3
best = (float('inf'), -1, -1)
# For the last 3 positions, what the minimum size is before it (including
# itself) and where that occurred. We use a deque here, but you can do some
# modulo math here instead since it has a fixed size.
window = deque([ (-1, float('inf')) ] * 3)
for i, size in enumerate(trigram_size):
# Compute what the best answer (so far) is using this trigram and see if
# it beats the best. If it ties we take it maximise distance.
j, j_min = window.popleft()
value = (j_min + size, j, i)
if value <= best:
best = value
# We now append the smallest value seen so far taking into account this
# trigram
smallest_idx, smallest_size = window[-1]
if size < smallest_size:
smallest_idx, smallest_size = i, size
window.append((smallest_idx, smallest_size))
print(f'minimum overlapping trigram pair has combined size {best[0]} at positions {best[1]} ({trigram_size[best[1]]}) and {best[2]} ({trigram_size[best[2]]})') |
Thanks for looking! I agree with your problem description and algorithm. Let's give that a try and I'll see if I can get it fast enough. I believe we can totally replace |
Do you know roughly how common?
I understand this is the case if we have full overlap, but is that true for partial overlap, too? |
Did you check whether simple heuristics with O(1) runtime might be good enough? We could fall back to the first and last trigram if the original trigrams overlap. |
I do not. I just noticed it frequently when I was looking at natural-language style searches, which contain common keywords. I guess it would be possible to test this.
Yes, this is true for partial overlap. This is because we create a "covering string". Let's say your original trigram intersection is
Yes I tried this, and from the examples I looked at it could make things substantially worse. I liked the approach in this PR because it is guaranteed to result in a smaller intersection than what we currently do. |
@keegancsmith I'd like to go with this change instead of implementing the alternate algorithm right now. I like that this change is simple and guarantees a strictly smaller candidate set than what we currently do. So marked it ready for review! |
} | ||
return | ||
} | ||
|
||
func minFrequencyNgramOffsets(ngramOffs []runeNgramOff, frequencies []uint32) (first, last runeNgramOff) { | ||
firstI, lastI := min2Index(frequencies) | ||
// If the frequencies are equal lets maximise distance in the query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up, I'd like to try removing the "maximize distance" heuristic. I looked at some examples, and it's not clear to me that maximizing distance between "AAA" and "AAA" is better than filtering on "AAAAAA". I wonder if this was important before just because we commonly had overlapping trigrams. So before we would have a lot of overlap like "AAAA", whereas with my change this isn't a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this optimization is likely unnecessary. I think it is only for the same trigram, and in that case I do think it makes a difference. But this seems like a very rare thing to happen and in the case it does happen it doesn't give us that much better perf.
For the examples of being different trigrams I don't see it making a difference that is worth the complexity.
} | ||
return | ||
} | ||
|
||
func minFrequencyNgramOffsets(ngramOffs []runeNgramOff, frequencies []uint32) (first, last runeNgramOff) { | ||
firstI, lastI := min2Index(frequencies) | ||
// If the frequencies are equal lets maximise distance in the query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this optimization is likely unnecessary. I think it is only for the same trigram, and in that case I do think it makes a difference. But this seems like a very rare thing to happen and in the case it does happen it doesn't give us that much better perf.
For the examples of being different trigrams I don't see it making a difference that is worth the complexity.
Follow up to #779. This PR removes the logic for trigrams with the same frequency, because it will no longer have a big effect.
We select the two least frequent trigrams to create the candidate match
iterator. It's common for these trigrams to overlap. This change shifts the
first and last trigrams to avoid overlap, which is guaranteed to result in a
smaller intersection. For frequent terms, this can substantially reduce the
number of candidate matches we consider.