-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BM25: Boost file name matches at root #792
Conversation
With this change we prioritize file name matches at the root of the repository. This is based on the intuition that more important files tend to be closer to the root. We also change the parameter b in the BM25 scoring function from 0.75 to 0.3 to reduce the impact of the document length on the final score. This is based on experiments that showed that our current scoring overly penalizes long but important documents. For example, we consider documents such as a README.md or CHANGELOG at the root of the repository of high quality. However, these documents also tend to be relatively long and are thus penalized. Test plan: Updated unit test
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed offline how tweaking b
to 0.3 feels a bit too manual. Maybe we could try this:
- Try
k=0.9, b=0.4
(reference). Run this on its own, as we're generally curious if this improves evals. - Also try these parameters together with this boosting change
- ALSO try setting b=0 to disable length penalization. This is fairly principled, as research has shown that BM25 degrades badly when documents are very long.
If none of this works, I feel we should drop this because we risk overfitting to a few use cases :)
|
||
boostFileName := 2 | ||
|
||
var evaluated bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this evaluated
flag? Could we just do this?
if cand.fileName {
boostFileName := 2
if isAtRoot(fm.FileName) {
boostFileName = 5
}
termFreqs[term] += boostFileName
} else {
termFreqs[term]++
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but we only need to calculate isAtRoot
once because all candidates belong to the same file. There might be candidates that match different parts of the filename, so in your version we might end up calling isAtRoot
more than we need to.
Looking at the evaluations, I don't think our current evaluation data set supports any of the changes. However, I think our evaluation data set might lack questions that highlight the problem of too long files, which would explain why we don't see any benefits here. For now, I think we should drop the PR.
|
Thanks for doing these evals! I'm +1 to closing and revisiting when we have more confidence in evals. |
With this change we prioritize file name matches at the root of the repository. This is based on the intuition that more important files tend to be closer to the root.
We also change the parameter b in the BM25 scoring function from 0.75 to 0.3 to reduce the impact of the document length on the final score. This is based on experiments that showed that our current scoring overly penalizes long but important documents.
For example, we consider documents such as a README.md or CHANGELOG at the root of the repository of high quality. However, these documents also tend to be relatively long and are thus penalized.
Note:
In previous experiments I saw that lowering b led to higher recall in our context evaluation. However I could not reproduce this. This change slightly shifts the numbers but doesn't change the overall score.
Test plan: