Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Searcher: improve how matches are built #59527

Merged
merged 4 commits into from
Jan 12, 2024
Merged

Searcher: improve how matches are built #59527

merged 4 commits into from
Jan 12, 2024

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Jan 11, 2024

This PR improves how searcher creates matches, making it more consistent with
how it's done in Zoekt.

Changes:

  • Pull chunking logic out of structural search code and into its own file
    chunk.go
  • Remove overlapping ranges (this is what Zoekt does when chunk matches are
    enabled)
  • Optimize the column calculation using the same strategy from Zoekt (zoekt#711)

Test plan

Adapted existing unit tests and added some new cases

@@ -242,6 +241,10 @@ func locsToRanges(buf []byte, locs [][]int) []protocol.Range {
prevStart := 0
prevStartLine := 0

c := columnHelper{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed some other opportunities for optimizations around newline handling. I've run out of steam for searcher improvements though, so just stopped here :)

func mergeMatches(matches [][]int, limit int) [][]int {
if len(matches) == 0 {
Copy link
Member Author

@jtibshirani jtibshirani Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This revisits the decision I made in a previous PR but is still in line with our discussion. I now think it's really nice to match the Zoekt behavior so we can share optimizations, mental models, etc.

c.lastOffset = offset
c.lastRuneCount = runeCount

return runeCount
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Zoekt, this line reads return runeCount + 1. However, that caused searcher tests to fail! It seems that searcher considers the column value start at 0, whereas Zoekt starts it at 1.

I plan to push a commit to update this to be consistent with Zoekt, and adapt all the tests. Since this discrepancy has been around for a while, I'm guessing we don't have frontend logic relying on the column value. Question for reviewer: could you double-check me on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @camdencheek who would know if column is used since I believe he included it in the chunkmatch API. If we don't actually use this value, can't we just remove it?

I double checked, looks good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it consistent with Zoekt sounds good to me. IIRC, this was all just trying to preserve historical behavior.

We do use column in the web client. However, we could probably get rid of it in the API because both column and line can be calculated on the fly if we we have the chunk content (and the line number it starts on). The downside to dropping it from the API is that Zoekt can calculate lines more efficiently because it stores the offsets of newlines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. I looked into this further, and the logic here is actually correct:

Definitely surprising, but no bug to fix at the moment.

@jtibshirani jtibshirani marked this pull request as ready for review January 11, 2024 21:12
@jtibshirani jtibshirani requested review from camdencheek and a team January 11, 2024 21:12
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

lastChunk := &chunks[len(chunks)-1] // pointer for mutability
if lastChunk.cover.End.Line+interChunkLines >= rr.Start.Line {
// The current range overlaps with the current chunk, so merge them
lastChunk.ranges = ranges[i-len(lastChunk.ranges) : i+1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is kinda tricky to get right. Why not just use append? If I am not mistaken append won't allocate, it will reuse the same underlying array as ranges. I guess the tricky thing there is on another level in that you worry about reusing the slice.

But either way I like that you re-use the underlying array. We don't do this in zoekt but should to avoid all the extra allocations. Did the benchmarks in searcher make you go down this path?

Copy link
Member Author

@jtibshirani jtibshirani Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a straight refactor, I didn't write any new logic here. I just moved these methods from search_structural to chunk as a clean-up, since they were shared between core searcher logic and structural search. Here's the commit: https://github.com/sourcegraph/sourcegraph/pull/59527/commits/d4f7087913c9413d61187a0a3d20b34858d8a03f.

I think there's a lot of room for improvement here, but I'd like to scope this PR and not touch this logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: in the future, I'll pull these moves out into their own PR since it makes the commit history clearer

End: protocol.Location{
Offset: lastLineEnd,
Line: inputRange.End.Line,
Column: int32(utf8.RuneCount(buf[lastLineStart:lastLineEnd])),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, this doesn't have the hidden n^2 because you only calculate this column once per line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my understanding as well. Also to clarify, this is pre-existing logic that I didn't touch, it's just moved: https://github.com/sourcegraph/sourcegraph/commit/d4f7087913c9413d61187a0a3d20b34858d8a03f.

c.lastOffset = offset
c.lastRuneCount = runeCount

return runeCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @camdencheek who would know if column is used since I believe he included it in the chunkmatch API. If we don't actually use this value, can't we just remove it?

I double checked, looks good.

@jtibshirani jtibshirani merged commit b5937ed into main Jan 12, 2024
15 checks passed
@jtibshirani jtibshirani deleted the jtibs/matches branch January 12, 2024 18:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants