dampen repetition-boost with log2 #658

keegancsmith · 2023-10-18T19:09:07Z

I sometimes notice very poor quality documents getting boosted on common terms due to them containing lots of results. This factor feels like it should work more as a tie-breaker, than overriding all other factors.

Note: This work was guided by experimentation locally. Still need to set up a more robust way to track the effects of ranking changes.

Test Plan: searching for "class user" only had code in the top results.

I sometimes notice very poor quality documents getting boosted on common terms due to them containing lots of results. This factor feels like it should work more as a tie-breaker, than overriding all other factors. Additionally this only adds the score if it will be non-zero. I've noticed the majority of search results have 0 here, so removes the noise in the debug output. Test Plan: searching for "class user" only had code in the top results.

stefanhengl · 2023-10-19T09:51:05Z

eval.go

+		// Prefer docs with several top-scored matches. We use log_2 (bits.Len) to
+		// prevent the repetitions overriding other factors. In this way it acts
+		// more like a tie break.
+		fileMatch.addScore("repetition-boost", scoreRepetitionFactor*float64(bits.Len(repetitions)), opts.DebugScore)


Curious why you pick bits.Len instead of math.Log2? Zoekt already uses math.Log for scoring traction. Doesn't bits.Len lead to a step-like behavior, boosting 7 repetitions just as much as 4?

math.Log (or Log2) would give an almost linear or at least strictly monotonous behavior for small repetitions and still dampen the boost for large number of repetitions.

Because I over worried about perf. You are right, I should probably just call log.

@stefanhengl What do you reckon? This will be a lot of calls to this. The other use of log happens at index time not per filematch.

BenchmarkBoostLen-32 1000000000 0.2085 ns/op BenchmarkBoostLog-32 191207702 6.110 ns/op

I would be more worried about the behaviour of the step-like function than about the performance. With 6ns/op you have to call this a lot to make a dent on a search? I think we can merge this, but we have to watch out for cases of good files being ranked lower because we haven't crossed the next 2x threshold to get a higher boost.

It also makes sense to me to use math.Log -- it seems really minor compared to the overall cost of assessing a match. And it makes things a bit more robust / easier to reason about.

jtibshirani

When I was working on ranking, I actually didn't see good examples where repetition factor helped! Its definition is a bit narrow: the number of matches in the document that share the top score. Also, am I reading the code correctly that it doesn't apply to chunk matches (which is what we use in Sourcegraph search?)

Maybe we could just remove it to simplify and think about adding it back in a more solid way.

stefanhengl · 2023-10-19T16:11:18Z

Here is the context why we added the repetition boost.

keegancsmith · 2023-10-19T16:12:43Z

Agreed I only ever saw it annoy me, which is why I ended up making it use log. I quite like the idea of removing it. Especially since we should really also be taking into account the document size (and maybe an estimate on how often a keyword appears in the corpus)

keegancsmith · 2023-10-20T04:44:53Z

Here is the context why we added the repetition boost.

@stefanhengl sorry I missed this comment. Lets chat about this later.

keegancsmith requested a review from a team October 18, 2023 19:09

keegancsmith force-pushed the k/repetition-boost branch from 8820c2c to 6a372ee Compare October 19, 2023 06:44

stefanhengl reviewed Oct 19, 2023

View reviewed changes

stefanhengl self-requested a review October 19, 2023 14:41

stefanhengl approved these changes Oct 19, 2023

View reviewed changes

jtibshirani reviewed Oct 19, 2023

View reviewed changes

keegancsmith closed this Oct 19, 2023

keegancsmith deleted the k/repetition-boost branch October 19, 2023 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dampen repetition-boost with log2 #658

dampen repetition-boost with log2 #658

keegancsmith commented Oct 18, 2023

stefanhengl Oct 19, 2023

keegancsmith Oct 19, 2023

keegancsmith Oct 19, 2023

stefanhengl Oct 19, 2023

jtibshirani Oct 19, 2023

jtibshirani left a comment

stefanhengl commented Oct 19, 2023

keegancsmith commented Oct 19, 2023

keegancsmith commented Oct 20, 2023

dampen repetition-boost with log2 #658

dampen repetition-boost with log2 #658

Conversation

keegancsmith commented Oct 18, 2023

stefanhengl Oct 19, 2023

Choose a reason for hiding this comment

keegancsmith Oct 19, 2023

Choose a reason for hiding this comment

keegancsmith Oct 19, 2023

Choose a reason for hiding this comment

stefanhengl Oct 19, 2023

Choose a reason for hiding this comment

jtibshirani Oct 19, 2023

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

stefanhengl commented Oct 19, 2023

keegancsmith commented Oct 19, 2023

keegancsmith commented Oct 20, 2023