-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ranking: include filename matches in bm25 #757
Conversation
@@ -127,14 +130,24 @@ func (d *indexData) scoreFileUsingBM25(fileMatch *FileMatch, doc uint32, cands [ | |||
termFreqs := map[string]int{} | |||
for _, cand := range cands { | |||
term := string(cand.substrLowered) | |||
termFreqs[term]++ | |||
|
|||
if cand.fileName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In text search, it's super common to "boost" matches on document title. The same thing applies here, as a filename match is strong signal. The choice of 2 is quite conservative, and means that a match on filename counts for two matches on content. I tried a few choices and didn't see evidence that a larger number would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
This improves slightly on our "golden queries" set. Before, for the query "owner for wildcard library" we didn't find client/wildcard/OWNERS. With this change, we now find it. Although the change doesn't make a huge difference on this eval set, I think it's worth it just for correctness/ consistency of our BM25 implementation. |
termFreqs[term] += 2 | ||
} else { | ||
termFreqs[term]++ | ||
} | ||
} | ||
|
||
// Compute the file length ratio. Usually the calculation would be based on terms, but using | ||
// bytes should work fine, as we're just computing a ratio. | ||
fileLength := float64(d.boundaries[doc+1] - d.boundaries[doc]) | ||
numFiles := len(d.boundaries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to guard against numFiles=0
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should never happen -- if you had an empty shard, there's no possibility to match and score a doc?
@@ -127,14 +130,24 @@ func (d *indexData) scoreFileUsingBM25(fileMatch *FileMatch, doc uint32, cands [ | |||
termFreqs := map[string]int{} | |||
for _, cand := range cands { | |||
term := string(cand.substrLowered) | |||
termFreqs[term]++ | |||
|
|||
if cand.fileName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
BM25 considers a file to be a better match when there are many occurrences of
terms in the file. It's important to count all term occurrences, including
those in other fields like the filename.
For historical reasons, Zoekt trims all filename matches from a result if there
are any content matches. This meant that in BM25 scoring, we didn't account for
filename matches.
This PR refactors the match code so that we only trim filename matches when
assembling the final
FileMatch
. We retain filename matches when creatingcandidateMatch
, which lets BM25 scoring use them. Even without the betterBM25 scoring, I think this refactor makes the code easier to follow.