ranking: add IDF to BM25 score calculation #788

stefanhengl · 2024-06-05T14:02:21Z

So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.

Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query terms.

Notes:
Also fixed an off-by-1 bug with how we count documents.

Test plan:

Updated unit test
Context evaluation results are slightly worse with a decrease from 64/89 to 63/89

So far, we didn't include IDF in our BM25 score function. Zoekt uses a trigram index and hence doesn't compute document frequency during indexing. We could add this information to the index, but it is not immediately obvious how to tokenize code in a way that is compatible with tokens from a natural language query. Here we calulate the document frequency at query time under the assumption that we visit all documents containing any of the query term. Test plan: - Updated unit test - Context evaluation improved from 60/89 to 63/89

score.go

stefanhengl · 2024-06-05T14:20:21Z

golden queries evals

before

Breakdown by class:
Find symbol     9/10
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      31/43
Gather information      12/16
Changelog       0/2
Ownership       2/2
How-to  1/1
Foreign language        0/2
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  17/24
Typescript      27/38
C++     1/3
Markdown        7/10
Graphql 1/1
JSON    1/1
Go      3/4
Codeowners      2/2
Python  1/1

Combined recall 64/89

after

Breakdown by class:
Find symbol     9/10
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      31/43
Gather information      11/16
Changelog       0/2
Ownership       2/2
How-to  1/1
Foreign language        0/2
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  17/24
Typescript      26/38
C++     1/3
Markdown        7/10
Graphql 1/1
JSON    1/1
Go      3/4
Codeowners      2/2
Python  1/1

Combined recall 63/89

jtibshirani · 2024-06-05T17:20:20Z

This looks good! Even if it doesn't improve the results over our current baseline, I feel good about this since we should really be implementing BM25 properly. Also we've definitely overfit to the golden queries by now, so I'm not too worried if we don't see a clear improvement.

I'd love to see these evals in particular:

Compare to latest golden queries snapshot
Compare to golden queries without my latest change to identify and search symbol definitions
Run on CodeSearchNet too

score.go

jtibshirani · 2024-06-05T17:25:22Z

score.go

 		tf := float64(freq)
 		sumTf += tf
-		score += ((k + 1.0) * tf) / (k*(1.0-b+b*L) + tf)
+
+		// Invariant: the keys of df are the union of the keys of tfs over all files.


As we discussed, this is tricky! I think for now we should loudly document this on the UseBM25Scoring option, so users know how it works.

In a follow-up, I'd love to introduce a new Zoekt query type like "TextQuery" that takes a list of terms, creates a disjunction, and applies BM25. Then users could only use BM25 with this query, and not accidentally use it with other types.

stefanhengl · 2024-06-06T11:20:24Z

This looks good! Even if it doesn't improve the results over our current baseline, I feel good about this since we should really be implementing BM25 properly. Also we've definitely overfit to the golden queries by now, so I'm not too worried if we don't see a clear improvement.

I'd love to see these evals in particular:

Compare to latest golden queries snapshot

see above

Compare to golden queries without my latest change to identify and search symbol definitions

golden queries evals

This PR with Sourcegraph@4f465c5, which is the parent commit of your latest change to identify and search symbol definitions, see here

Breakdown by class:
Find symbol     9/10
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      31/43
Gather information      11/16
Changelog       0/2
Ownership       2/2
How-to  1/1
Foreign language        0/2
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  17/24
Typescript      26/38
C++     1/3
Markdown        7/10
Graphql 1/1
JSON    1/1
Go      3/4
Codeowners      2/2
Python  1/1

Combined recall 63/89

stefanhengl · 2024-06-06T13:37:05Z

CodeSearchNet

Sourcegraph@4f465c5

Before

Recall (files)  91/99
Recall (chunks) 75/99
Average chunk overlap   0.89

After

Recall (files)  89/99
Recall (chunks) 77/99
Average chunk overlap   0.88

api_proto.go

score.go

keegancsmith

added an idea to avoid the public api changes.

eval.go

jtibshirani

Nice! Thanks for all the iterations.

stefanhengl added 2 commits June 5, 2024 15:35

fix numFiles, use d.numDocs()

827d00f

stefanhengl requested a review from jtibshirani June 5, 2024 14:02

cla-bot bot added the cla-signed label Jun 5, 2024

stefanhengl commented Jun 5, 2024

View reviewed changes

score.go Show resolved Hide resolved

jtibshirani reviewed Jun 5, 2024

View reviewed changes

stefanhengl added 2 commits June 6, 2024 11:49

perform full BM25 calc in scoreFilesUsingBM25

af9a0a4

FileMatch

793e69c

stefanhengl requested a review from jtibshirani June 6, 2024 14:00

jtibshirani reviewed Jun 6, 2024

View reviewed changes

api_proto.go Outdated Show resolved Hide resolved

score.go Outdated Show resolved Hide resolved

keegancsmith reviewed Jun 7, 2024

View reviewed changes

eval.go Show resolved Hide resolved

eval.go Outdated Show resolved Hide resolved

stefanhengl added 3 commits June 7, 2024 11:36

move df out of loop

0b3c8f0

revert to auxiliary slice

a9f21ad

revert change to FileMatch

43cd752

keegancsmith approved these changes Jun 7, 2024

View reviewed changes

eval.go Outdated Show resolved Hide resolved

jtibshirani approved these changes Jun 7, 2024

View reviewed changes

PR comment

f283039

stefanhengl merged commit 376af3a into main Jun 10, 2024
9 checks passed

stefanhengl deleted the sh/idf branch June 10, 2024 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ranking: add IDF to BM25 score calculation #788

ranking: add IDF to BM25 score calculation #788

stefanhengl commented Jun 5, 2024 •

edited

Loading

stefanhengl commented Jun 5, 2024 •

edited

Loading

jtibshirani commented Jun 5, 2024

jtibshirani Jun 5, 2024

stefanhengl commented Jun 6, 2024 •

edited

Loading

stefanhengl commented Jun 6, 2024 •

edited

Loading

keegancsmith left a comment

jtibshirani left a comment

ranking: add IDF to BM25 score calculation #788

ranking: add IDF to BM25 score calculation #788

Conversation

stefanhengl commented Jun 5, 2024 • edited Loading

stefanhengl commented Jun 5, 2024 • edited Loading

jtibshirani commented Jun 5, 2024

jtibshirani Jun 5, 2024

Choose a reason for hiding this comment

stefanhengl commented Jun 6, 2024 • edited Loading

stefanhengl commented Jun 6, 2024 • edited Loading

keegancsmith left a comment

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

stefanhengl commented Jun 5, 2024 •

edited

Loading

stefanhengl commented Jun 5, 2024 •

edited

Loading

stefanhengl commented Jun 6, 2024 •

edited

Loading

stefanhengl commented Jun 6, 2024 •

edited

Loading