Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

score: boost exported go ident and downrank _test.go #675

Merged
merged 2 commits into from
Oct 26, 2023
Merged

Conversation

keegancsmith
Copy link
Member

Right now our symbol analyser doesn't tell us if a symbol is exported. We add a go specific tweak here to boost those results. Ideally this could be something that is encoded in the symbol information.

Additionally we do downrank _test.go files via the doc-order. But in the case of symbol matches the boosting overweighs doc order signficantly. I found the extra downraking quite useful when experimenting.

Test Plan: lots of manual testing on the keyword branch

@keegancsmith keegancsmith requested a review from a team October 26, 2023 08:53
@keegancsmith keegancsmith merged commit b5a5fdc into main Oct 26, 2023
8 checks passed
@keegancsmith keegancsmith deleted the k/go-boost branch October 26, 2023 11:20
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels surprising/ tricky to be detecting this here within symbol scoring. If we think that test files should be down-ranked more aggressively, it seems like we should just increase the contribution of "doc order" in the overall ranking, by increasing the weight. This could really help across languages too.

Since we don't reliably have code intel "file ranks" (PageRank-inspired ranking), I've been thinking we should come up with an alternative for "file importance" that's super similar to "doc order" and give it a big weight just like we did for file ranks. In addition to checks for vendored/ generated/ test, I found the "many symbols" signal to be super useful, and highly correlated with PageRank!

@keegancsmith
Copy link
Member Author

It feels surprising/ tricky to be detecting this here within symbol scoring. If we think that test files should be down-ranked more aggressively, it seems like we should just increase the contribution of "doc order" in the overall ranking, by increasing the weight. This could really help across languages too.

I think my issue with doc-order is it is completely based on the size of the shard/etc. This makes it quite hard to reason about when comparing across shards. Doc-order is really useful within a shard to try and ensure we search more important documents first (in case we hit limits). But otherwise I think we do need some sort of score value that we mix in.

I agree this feels wrong to be do at scoring time, and I think as we evolve this it will end up at indexing time. One advantage right now to doing it at scoring time is we don't need to reindex so is great for experimentation.

I found the "many symbols" signal to be super useful, and highly correlated with PageRank!

More symbols means more important or the other way around? For example generated files have a tendancy to have lots of symbols. But I think with proper detection of generated files/etc then this likely is a good sign. eg in our codebases the types.go or api.go files are generally a good document.

@stefanhengl
Copy link
Member

Should we consider introducing something like a document category to replace or augment doc order? So instead of using the position of the document we use its category which would make it comparable across shards.

@jtibshirani
Copy link
Member

I filed https://github.com/sourcegraph/sourcegraph/issues/57950, we could continue the conversation there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants