Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dampen repetition-boost with log2 #658

Closed
wants to merge 1 commit into from
Closed

Conversation

keegancsmith
Copy link
Member

I sometimes notice very poor quality documents getting boosted on common terms due to them containing lots of results. This factor feels like it should work more as a tie-breaker, than overriding all other factors.

Note: This work was guided by experimentation locally. Still need to set up a more robust way to track the effects of ranking changes.

Test Plan: searching for "class user" only had code in the top results.

@keegancsmith keegancsmith requested a review from a team October 18, 2023 19:09
I sometimes notice very poor quality documents getting boosted on common
terms due to them containing lots of results. This factor feels like it
should work more as a tie-breaker, than overriding all other factors.

Additionally this only adds the score if it will be non-zero. I've
noticed the majority of search results have 0 here, so removes the noise
in the debug output.

Test Plan: searching for "class user" only had code in the top results.
// Prefer docs with several top-scored matches. We use log_2 (bits.Len) to
// prevent the repetitions overriding other factors. In this way it acts
// more like a tie break.
fileMatch.addScore("repetition-boost", scoreRepetitionFactor*float64(bits.Len(repetitions)), opts.DebugScore)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you pick bits.Len instead of math.Log2? Zoekt already uses math.Log for scoring traction. Doesn't bits.Len lead to a step-like behavior, boosting 7 repetitions just as much as 4?

math.Log (or Log2) would give an almost linear or at least strictly monotonous behavior for small repetitions and still dampen the boost for large number of repetitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I over worried about perf. You are right, I should probably just call log.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanhengl What do you reckon? This will be a lot of calls to this. The other use of log happens at index time not per filematch.

BenchmarkBoostLen-32            1000000000               0.2085 ns/op
BenchmarkBoostLog-32            191207702                6.110 ns/op

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more worried about the behaviour of the step-like function than about the performance. With 6ns/op you have to call this a lot to make a dent on a search? I think we can merge this, but we have to watch out for cases of good files being ranked lower because we haven't crossed the next 2x threshold to get a higher boost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also makes sense to me to use math.Log -- it seems really minor compared to the overall cost of assessing a match. And it makes things a bit more robust / easier to reason about.

@stefanhengl stefanhengl self-requested a review October 19, 2023 14:41
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was working on ranking, I actually didn't see good examples where repetition factor helped! Its definition is a bit narrow: the number of matches in the document that share the top score. Also, am I reading the code correctly that it doesn't apply to chunk matches (which is what we use in Sourcegraph search?)

Maybe we could just remove it to simplify and think about adding it back in a more solid way.

@stefanhengl
Copy link
Member

Here is the context why we added the repetition boost.

@keegancsmith
Copy link
Member Author

Agreed I only ever saw it annoy me, which is why I ended up making it use log. I quite like the idea of removing it. Especially since we should really also be taking into account the document size (and maybe an estimate on how often a keyword appears in the corpus)

@keegancsmith keegancsmith deleted the k/repetition-boost branch October 19, 2023 16:12
@keegancsmith
Copy link
Member Author

Here is the context why we added the repetition boost.

@stefanhengl sorry I missed this comment. Lets chat about this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants