Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM25: Boost file name matches at root #792

Closed
wants to merge 1 commit into from

Conversation

stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Jun 13, 2024

With this change we prioritize file name matches at the root of the repository. This is based on the intuition that more important files tend to be closer to the root.

We also change the parameter b in the BM25 scoring function from 0.75 to 0.3 to reduce the impact of the document length on the final score. This is based on experiments that showed that our current scoring overly penalizes long but important documents.

For example, we consider documents such as a README.md or CHANGELOG at the root of the repository of high quality. However, these documents also tend to be relatively long and are thus penalized.

Note:
In previous experiments I saw that lowering b led to higher recall in our context evaluation. However I could not reproduce this. This change slightly shifts the numbers but doesn't change the overall score.

Test plan:

  • Updated unit test

With this change we prioritize file name matches at the root of the
repository. This is based on the intuition that more important files
tend to be closer to the root.

We also change the parameter b in the BM25 scoring function from 0.75 to
0.3 to reduce the impact of the document length on the final score. This
is based on experiments that showed that our current scoring overly
penalizes long but important documents.

For example, we consider documents such as a README.md or CHANGELOG at
the root of the repository of high quality. However, these documents
also tend to be relatively long and are thus penalized.

Test plan:
Updated unit test
@stefanhengl stefanhengl requested a review from jtibshirani June 13, 2024 12:56
@cla-bot cla-bot bot added the cla-signed label Jun 13, 2024
@stefanhengl
Copy link
Member Author

Sourcegraph@f514a7d3bc080ddfa25b81818aa93089dd46a692

Breakdown by class:
Find symbol     10/11
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      35/47
Gather information      12/16
Changelog       0/2
Ownership       1/2
How-to  2/2
Foreign language        0/3
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  22/28
Typescript      28/38
C++     1/3
Markdown        7/11
Graphql 1/1
JSON    1/2
Go      2/4
Codeowners      1/2
Python  1/1
Shell   1/1

Combined recall 69/96

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline how tweaking b to 0.3 feels a bit too manual. Maybe we could try this:

  1. Try k=0.9, b=0.4 (reference). Run this on its own, as we're generally curious if this improves evals.
  2. Also try these parameters together with this boosting change
  3. ALSO try setting b=0 to disable length penalization. This is fairly principled, as research has shown that BM25 degrades badly when documents are very long.

If none of this works, I feel we should drop this because we risk overfitting to a few use cases :)


boostFileName := 2

var evaluated bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this evaluated flag? Could we just do this?

		if cand.fileName {
                        boostFileName := 2
			if isAtRoot(fm.FileName) {
			       boostFileName = 5
			}
			termFreqs[term] += boostFileName
		} else {
			termFreqs[term]++
		}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but we only need to calculate isAtRoot once because all candidates belong to the same file. There might be candidates that match different parts of the filename, so in your version we might end up calling isAtRoot more than we need to.

@stefanhengl
Copy link
Member Author

stefanhengl commented Jun 14, 2024

We discussed offline how tweaking b to 0.3 feels a bit too manual. Maybe we could try this:

  1. Try k=0.9, b=0.4 (reference). Run this on its own, as we're generally curious if this improves evals.
  2. Also try these parameters together with this boosting change
  3. ALSO try setting b=0 to disable length penalization. This is fairly principled, as research has shown that BM25 degrades badly when documents are very long.

If none of this works, I feel we should drop this because we risk overfitting to a few use cases :)

Looking at the evaluations, I don't think our current evaluation data set supports any of the changes. However, I think our evaluation data set might lack questions that highlight the problem of too long files, which would explain why we don't see any benefits here.

For now, I think we should drop the PR.

Repo SHA
sourcegraph/sourcegraph sourcegraph/sourcegraph@e0fcb49
sourcegraph/zoekt (w/o boosting change) c21df41
sourcegraph/zoekt (with boosting change) 9c75cfa (this PR)
k b Boosting change Recall Comment
1.2 0.3 no 67/96
1.2 0.75 no 66/96 Baseline
0.9 0.4 no 65/96
1.2 0.3 yes 65/96
0.9 0.4 yes 64/96
1.2 0.0 no 61/96
0.9 0.0 no 61/96
1.2 0.0 yes 60/96

@jtibshirani
Copy link
Member

Thanks for doing these evals! I'm +1 to closing and revisiting when we have more confidence in evals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants