Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM25: Boost file name matches at root #792

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions build/scoring_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -106,8 +106,8 @@ func TestBM25(t *testing.T) {
fileName: "a/b/c/config.go",
query: &query.Substring{Pattern: "config.go"},
language: "Go",
// bm25-score: 0.60 <- sum-termFrequencyScore: 5.00, length-ratio: 0.00
wantScore: 0.60,
// bm25-score: 0.45 <- sum-termFrequencyScore: 2.00, length-ratio: 0.00
wantScore: 0.45,
},
}

Expand Down
2 changes: 1 addition & 1 deletion eval.go
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,7 @@ nextFileMatch:
// document frequencies. Since we don't store document frequencies in the index,
// we have to defer the calculation of the final BM25 score to after the whole
// shard has been processed.
tf = calculateTermFrequency(finalCands, df)
tf = calculateTermFrequency(&fileMatch, finalCands, df)
} else {
// Use the standard, non-experimental scoring method by default
d.scoreFile(&fileMatch, nextDoc, mt, known, opts)
Expand Down
32 changes: 28 additions & 4 deletions score.go
Original file line number Diff line number Diff line change
Expand Up @@ -108,18 +108,32 @@ func (d *indexData) scoreFile(fileMatch *FileMatch, doc uint32, mt matchTree, kn
}
}

func isAtRoot(path string) bool {
return strings.IndexRune(path, '/') == -1
}

// calculateTermFrequency computes the term frequency for the file match.
//
// Filename matches count more than content matches. This mimics a common text
// search strategy where you 'boost' matches on document titles.
func calculateTermFrequency(cands []*candidateMatch, df termDocumentFrequency) map[string]int {
func calculateTermFrequency(fm *FileMatch, cands []*candidateMatch, df termDocumentFrequency) map[string]int {
// Treat each candidate match as a term and compute the frequencies. For now, ignore case
// sensitivity and treat filenames and symbols the same as content.
termFreqs := map[string]int{}

boostFileName := 2

var evaluated bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this evaluated flag? Could we just do this?

		if cand.fileName {
                        boostFileName := 2
			if isAtRoot(fm.FileName) {
			       boostFileName = 5
			}
			termFreqs[term] += boostFileName
		} else {
			termFreqs[term]++
		}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but we only need to calculate isAtRoot once because all candidates belong to the same file. There might be candidates that match different parts of the filename, so in your version we might end up calling isAtRoot more than we need to.

for _, cand := range cands {
term := string(cand.substrLowered)
if cand.fileName {
termFreqs[term] += 5
if !evaluated {
if isAtRoot(fm.FileName) {
boostFileName = 5
}
evaluated = true
}
termFreqs[term] += boostFileName
} else {
termFreqs[term]++
}
Expand Down Expand Up @@ -155,8 +169,18 @@ type termFrequency struct {
// This keeps things simple for now, since BM25 is not normalized and can be
// tricky to combine with other scoring signals.
func (d *indexData) scoreFilesUsingBM25(fileMatches []FileMatch, tfs []termFrequency, df termDocumentFrequency, opts *SearchOptions) {
// Use standard parameter defaults (used in Lucene and academic papers)
k, b := 1.2, 0.75
// k determines how quickly the TF score saturates with increasing term
// frequencies and b ∈ [0,1] determines how much the score is down-weighted for
// longer documents.
//
// The standard parameter values, used in Lucene and academic papers, are k=1.2
// and b=0.75. However, there is some evidence that other values might work
// better depending on the characteristics of the corpus.
//
// In our experiments we found that smaller values of b work well for our use
// case. This means we don't penalize long files as much. b=0.3 is at the lower
// end of the spectrum of values that are reported in the literature.
k, b := 1.2, 0.3

averageFileLength := float64(d.boundaries[d.numDocs()]) / float64(d.numDocs())
// This is very unlikely, but explicitly guard against division by zero.
Expand Down
2 changes: 1 addition & 1 deletion score_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ func TestCalculateTermFrequency(t *testing.T) {
t.Run("", func(t *testing.T) {
fm := FileMatch{}
df := make(termDocumentFrequency)
tf := calculateTermFrequency(c.cands, df)
tf := calculateTermFrequency(&fm, c.cands, df)

if !maps.Equal(df, c.wantDF) {
t.Errorf("got %v, want %v", df, c.wantDF)
Expand Down
Loading