Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranking: add IDF to BM25 score calculation #788

Merged
merged 8 commits into from
Jun 10, 2024
Merged

ranking: add IDF to BM25 score calculation #788

merged 8 commits into from
Jun 10, 2024

Conversation

stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Jun 5, 2024

So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.

Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query terms.

Notes:
Also fixed an off-by-1 bug with how we count documents.

Test plan:

  • Updated unit test
  • Context evaluation results are slightly worse with a decrease from 64/89 to 63/89

So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.

Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query term.

Test plan:
- Updated unit test
- Context evaluation improved from 60/89 to 63/89
@stefanhengl stefanhengl requested a review from jtibshirani June 5, 2024 14:02
@cla-bot cla-bot bot added the cla-signed label Jun 5, 2024
score.go Show resolved Hide resolved
@stefanhengl
Copy link
Member Author

stefanhengl commented Jun 5, 2024

golden queries evals

before

Breakdown by class:
Find symbol     9/10
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      31/43
Gather information      12/16
Changelog       0/2
Ownership       2/2
How-to  1/1
Foreign language        0/2
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  17/24
Typescript      27/38
C++     1/3
Markdown        7/10
Graphql 1/1
JSON    1/1
Go      3/4
Codeowners      2/2
Python  1/1

Combined recall 64/89

after

Breakdown by class:
Find symbol     9/10
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      31/43
Gather information      11/16
Changelog       0/2
Ownership       2/2
How-to  1/1
Foreign language        0/2
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  17/24
Typescript      26/38
C++     1/3
Markdown        7/10
Graphql 1/1
JSON    1/1
Go      3/4
Codeowners      2/2
Python  1/1

Combined recall 63/89

@jtibshirani
Copy link
Member

This looks good! Even if it doesn't improve the results over our current baseline, I feel good about this since we should really be implementing BM25 properly. Also we've definitely overfit to the golden queries by now, so I'm not too worried if we don't see a clear improvement.

I'd love to see these evals in particular:

  • Compare to latest golden queries snapshot
  • Compare to golden queries without my latest change to identify and search symbol definitions
  • Run on CodeSearchNet too

score.go Outdated Show resolved Hide resolved
score.go Show resolved Hide resolved
score.go Show resolved Hide resolved
score.go Outdated Show resolved Hide resolved
score.go Outdated
tf := float64(freq)
sumTf += tf
score += ((k + 1.0) * tf) / (k*(1.0-b+b*L) + tf)

// Invariant: the keys of df are the union of the keys of tfs over all files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, this is tricky! I think for now we should loudly document this on the UseBM25Scoring option, so users know how it works.

In a follow-up, I'd love to introduce a new Zoekt query type like "TextQuery" that takes a list of terms, creates a disjunction, and applies BM25. Then users could only use BM25 with this query, and not accidentally use it with other types.

@stefanhengl
Copy link
Member Author

stefanhengl commented Jun 6, 2024

This looks good! Even if it doesn't improve the results over our current baseline, I feel good about this since we should really be implementing BM25 properly. Also we've definitely overfit to the golden queries by now, so I'm not too worried if we don't see a clear improvement.

I'd love to see these evals in particular:

  • Compare to latest golden queries snapshot

see above

  • Compare to golden queries without my latest change to identify and search symbol definitions

golden queries evals

This PR with Sourcegraph@4f465c5, which is the parent commit of your latest change to identify and search symbol definitions, see here

Breakdown by class:
Find symbol     9/10
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      31/43
Gather information      11/16
Changelog       0/2
Ownership       2/2
How-to  1/1
Foreign language        0/2
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  17/24
Typescript      26/38
C++     1/3
Markdown        7/10
Graphql 1/1
JSON    1/1
Go      3/4
Codeowners      2/2
Python  1/1

Combined recall 63/89

@stefanhengl
Copy link
Member Author

stefanhengl commented Jun 6, 2024

CodeSearchNet

Sourcegraph@4f465c5

Before

Recall (files)  91/99
Recall (chunks) 75/99
Average chunk overlap   0.89

After

Recall (files)  89/99
Recall (chunks) 77/99
Average chunk overlap   0.88

@stefanhengl stefanhengl requested a review from jtibshirani June 6, 2024 14:00
api_proto.go Outdated Show resolved Hide resolved
score.go Outdated Show resolved Hide resolved
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an idea to avoid the public api changes.

eval.go Show resolved Hide resolved
eval.go Outdated Show resolved Hide resolved
eval.go Outdated Show resolved Hide resolved
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for all the iterations.

@stefanhengl stefanhengl merged commit 376af3a into main Jun 10, 2024
9 checks passed
@stefanhengl stefanhengl deleted the sh/idf branch June 10, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants