-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ranking: add IDF to BM25 score calculation #788
Conversation
So far, we didn't include IDF in our BM25 score function. Zoekt uses a trigram index and hence doesn't compute document frequency during indexing. We could add this information to the index, but it is not immediately obvious how to tokenize code in a way that is compatible with tokens from a natural language query. Here we calulate the document frequency at query time under the assumption that we visit all documents containing any of the query term. Test plan: - Updated unit test - Context evaluation improved from 60/89 to 63/89
golden queries evals before
after
|
This looks good! Even if it doesn't improve the results over our current baseline, I feel good about this since we should really be implementing BM25 properly. Also we've definitely overfit to the golden queries by now, so I'm not too worried if we don't see a clear improvement. I'd love to see these evals in particular:
|
score.go
Outdated
tf := float64(freq) | ||
sumTf += tf | ||
score += ((k + 1.0) * tf) / (k*(1.0-b+b*L) + tf) | ||
|
||
// Invariant: the keys of df are the union of the keys of tfs over all files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, this is tricky! I think for now we should loudly document this on the UseBM25Scoring
option, so users know how it works.
In a follow-up, I'd love to introduce a new Zoekt query type like "TextQuery" that takes a list of terms, creates a disjunction, and applies BM25. Then users could only use BM25 with this query, and not accidentally use it with other types.
see above
golden queries evals This PR with
|
CodeSearchNet
Before
After
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added an idea to avoid the public api changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for all the iterations.
So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.
Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query terms.
Notes:
Also fixed an off-by-1 bug with how we count documents.
Test plan: