-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Searcher: improve how matches are built #59527
Conversation
@@ -242,6 +241,10 @@ func locsToRanges(buf []byte, locs [][]int) []protocol.Range { | |||
prevStart := 0 | |||
prevStartLine := 0 | |||
|
|||
c := columnHelper{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed some other opportunities for optimizations around newline handling. I've run out of steam for searcher improvements though, so just stopped here :)
func mergeMatches(matches [][]int, limit int) [][]int { | ||
if len(matches) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This revisits the decision I made in a previous PR but is still in line with our discussion. I now think it's really nice to match the Zoekt behavior so we can share optimizations, mental models, etc.
c.lastOffset = offset | ||
c.lastRuneCount = runeCount | ||
|
||
return runeCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Zoekt, this line reads return runeCount + 1
. However, that caused searcher tests to fail! It seems that searcher considers the column value start at 0, whereas Zoekt starts it at 1.
I plan to push a commit to update this to be consistent with Zoekt, and adapt all the tests. Since this discrepancy has been around for a while, I'm guessing we don't have frontend logic relying on the column value. Question for reviewer: could you double-check me on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @camdencheek who would know if column is used since I believe he included it in the chunkmatch API. If we don't actually use this value, can't we just remove it?
I double checked, looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it consistent with Zoekt sounds good to me. IIRC, this was all just trying to preserve historical behavior.
We do use column in the web client. However, we could probably get rid of it in the API because both column and line can be calculated on the fly if we we have the chunk content (and the line number it starts on). The downside to dropping it from the API is that Zoekt can calculate lines more efficiently because it stores the offsets of newlines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context. I looked into this further, and the logic here is actually correct:
- The GraphQL API converts the column to
character
, which is 0-indexed - When frontend processes Zoekt matches, it subtracts 1 from the column
Definitely surprising, but no bug to fix at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
lastChunk := &chunks[len(chunks)-1] // pointer for mutability | ||
if lastChunk.cover.End.Line+interChunkLines >= rr.Start.Line { | ||
// The current range overlaps with the current chunk, so merge them | ||
lastChunk.ranges = ranges[i-len(lastChunk.ranges) : i+1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic is kinda tricky to get right. Why not just use append? If I am not mistaken append won't allocate, it will reuse the same underlying array as ranges. I guess the tricky thing there is on another level in that you worry about reusing the slice.
But either way I like that you re-use the underlying array. We don't do this in zoekt but should to avoid all the extra allocations. Did the benchmarks in searcher make you go down this path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a straight refactor, I didn't write any new logic here. I just moved these methods from search_structural
to chunk
as a clean-up, since they were shared between core searcher logic and structural search. Here's the commit: https://github.com/sourcegraph/sourcegraph/pull/59527/commits/d4f7087913c9413d61187a0a3d20b34858d8a03f.
I think there's a lot of room for improvement here, but I'd like to scope this PR and not touch this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: in the future, I'll pull these moves out into their own PR since it makes the commit history clearer
End: protocol.Location{ | ||
Offset: lastLineEnd, | ||
Line: inputRange.End.Line, | ||
Column: int32(utf8.RuneCount(buf[lastLineStart:lastLineEnd])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, this doesn't have the hidden n^2 because you only calculate this column once per line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my understanding as well. Also to clarify, this is pre-existing logic that I didn't touch, it's just moved: https://github.com/sourcegraph/sourcegraph/commit/d4f7087913c9413d61187a0a3d20b34858d8a03f.
c.lastOffset = offset | ||
c.lastRuneCount = runeCount | ||
|
||
return runeCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @camdencheek who would know if column is used since I believe he included it in the chunkmatch API. If we don't actually use this value, can't we just remove it?
I double checked, looks good.
This PR improves how searcher creates matches, making it more consistent with
how it's done in Zoekt.
Changes:
chunk.go
enabled)
Test plan
Adapted existing unit tests and added some new cases