BM25: Boost file name matches at root #792

stefanhengl · 2024-06-13T12:56:03Z

With this change we prioritize file name matches at the root of the repository. This is based on the intuition that more important files tend to be closer to the root.

We also change the parameter b in the BM25 scoring function from 0.75 to 0.3 to reduce the impact of the document length on the final score. This is based on experiments that showed that our current scoring overly penalizes long but important documents.

For example, we consider documents such as a README.md or CHANGELOG at the root of the repository of high quality. However, these documents also tend to be relatively long and are thus penalized.

Note:
In previous experiments I saw that lowering b led to higher recall in our context evaluation. However I could not reproduce this. This change slightly shifts the numbers but doesn't change the overall score.

Test plan:

Updated unit test

With this change we prioritize file name matches at the root of the repository. This is based on the intuition that more important files tend to be closer to the root. We also change the parameter b in the BM25 scoring function from 0.75 to 0.3 to reduce the impact of the document length on the final score. This is based on experiments that showed that our current scoring overly penalizes long but important documents. For example, we consider documents such as a README.md or CHANGELOG at the root of the repository of high quality. However, these documents also tend to be relatively long and are thus penalized. Test plan: Updated unit test

stefanhengl · 2024-06-13T12:57:00Z

Sourcegraph@f514a7d3bc080ddfa25b81818aa93089dd46a692

Breakdown by class:
Find symbol     10/11
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      35/47
Gather information      12/16
Changelog       0/2
Ownership       1/2
How-to  2/2
Foreign language        0/3
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  22/28
Typescript      28/38
C++     1/3
Markdown        7/11
Graphql 1/1
JSON    1/2
Go      2/4
Codeowners      1/2
Python  1/1
Shell   1/1

Combined recall 69/96

jtibshirani

We discussed offline how tweaking b to 0.3 feels a bit too manual. Maybe we could try this:

Try k=0.9, b=0.4 (reference). Run this on its own, as we're generally curious if this improves evals.
Also try these parameters together with this boosting change
ALSO try setting b=0 to disable length penalization. This is fairly principled, as research has shown that BM25 degrades badly when documents are very long.

If none of this works, I feel we should drop this because we risk overfitting to a few use cases :)

jtibshirani · 2024-06-13T16:12:25Z

score.go

+
+	boostFileName := 2
+
+	var evaluated bool


Why do we need this evaluated flag? Could we just do this?

if cand.fileName { boostFileName := 2 if isAtRoot(fm.FileName) { boostFileName = 5 } termFreqs[term] += boostFileName } else { termFreqs[term]++ } }

We could, but we only need to calculate isAtRoot once because all candidates belong to the same file. There might be candidates that match different parts of the filename, so in your version we might end up calling isAtRoot more than we need to.

stefanhengl · 2024-06-14T09:48:20Z

We discussed offline how tweaking b to 0.3 feels a bit too manual. Maybe we could try this:

Try k=0.9, b=0.4 (reference). Run this on its own, as we're generally curious if this improves evals.

Also try these parameters together with this boosting change

ALSO try setting b=0 to disable length penalization. This is fairly principled, as research has shown that BM25 degrades badly when documents are very long.

If none of this works, I feel we should drop this because we risk overfitting to a few use cases :)

Looking at the evaluations, I don't think our current evaluation data set supports any of the changes. However, I think our evaluation data set might lack questions that highlight the problem of too long files, which would explain why we don't see any benefits here.

For now, I think we should drop the PR.

Repo	SHA
sourcegraph/sourcegraph	sourcegraph/sourcegraph@e0fcb49
sourcegraph/zoekt (w/o boosting change)	`c21df41`
sourcegraph/zoekt (with boosting change)	`9c75cfa` (this PR)

k	b	Boosting change	Recall	Comment
1.2	0.3	no	67/96
1.2	0.75	no	66/96	Baseline
0.9	0.4	no	65/96
1.2	0.3	yes	65/96
0.9	0.4	yes	64/96
1.2	0.0	no	61/96
0.9	0.0	no	61/96
1.2	0.0	yes	60/96

jtibshirani · 2024-06-14T18:46:53Z

Thanks for doing these evals! I'm +1 to closing and revisiting when we have more confidence in evals.

stefanhengl requested a review from jtibshirani June 13, 2024 12:56

cla-bot bot added the cla-signed label Jun 13, 2024

jtibshirani reviewed Jun 13, 2024

View reviewed changes

stefanhengl closed this Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BM25: Boost file name matches at root #792

BM25: Boost file name matches at root #792

stefanhengl commented Jun 13, 2024 •

edited

Loading

stefanhengl commented Jun 13, 2024

jtibshirani left a comment

jtibshirani Jun 13, 2024

stefanhengl Jun 14, 2024

stefanhengl commented Jun 14, 2024 •

edited

Loading

jtibshirani commented Jun 14, 2024

BM25: Boost file name matches at root #792

BM25: Boost file name matches at root #792

Conversation

stefanhengl commented Jun 13, 2024 • edited Loading

stefanhengl commented Jun 13, 2024

jtibshirani left a comment

Choose a reason for hiding this comment

jtibshirani Jun 13, 2024

Choose a reason for hiding this comment

stefanhengl Jun 14, 2024

Choose a reason for hiding this comment

stefanhengl commented Jun 14, 2024 • edited Loading

jtibshirani commented Jun 14, 2024

stefanhengl commented Jun 13, 2024 •

edited

Loading

stefanhengl commented Jun 14, 2024 •

edited

Loading