Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't modify candidates #773

Merged
merged 1 commit into from
May 1, 2024
Merged

Conversation

stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Apr 30, 2024

While working on ranking, I noticed that sum-tf is wrong if we have filename and content matches.

image

We use finalCands in our BM25 scoring, however, finalCands is modified in fillChunkMatches and fillMatches which can lead to surprising scores.

Test plan:
updated unit test

Since a recent change, we use finalCands in our BM25 scoring, however
finalCands is modified in fillChunkMatches which led to suprising
scores.

Test plan:
updated unit test
@cla-bot cla-bot bot added the cla-signed label Apr 30, 2024
@stefanhengl stefanhengl requested a review from jtibshirani April 30, 2024 14:47
@stefanhengl stefanhengl marked this pull request as ready for review April 30, 2024 14:47
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important fix, thanks!

// keyword-score:1.63 (sum-tf: 6.00, length-ratio: 2.00)
wantScore: 1.63,
// keyword-score:1.69 (sum-tf: 7.00, length-ratio: 2.00)
wantScore: 1.69,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why I didn't catch this 🤦

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add more info to the debug message? We could print the term frequency map instead of sum-tf. This wouldn't have helped for this test case, because we are just searching for one term, but for a more complex query it might be easier to debug.

@stefanhengl stefanhengl merged commit 72f9500 into main May 1, 2024
9 checks passed
@stefanhengl stefanhengl deleted the sh/bm25/dont-modify-final-cands branch May 1, 2024 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants