Searcher: improve how matches are built #59527

jtibshirani · 2024-01-11T20:39:31Z

This PR improves how searcher creates matches, making it more consistent with
how it's done in Zoekt.

Changes:

Pull chunking logic out of structural search code and into its own file
chunk.go
Remove overlapping ranges (this is what Zoekt does when chunk matches are
enabled)
Optimize the column calculation using the same strategy from Zoekt (zoekt#711)

Test plan

Adapted existing unit tests and added some new cases

jtibshirani · 2024-01-11T20:43:23Z

cmd/searcher/internal/search/search_regex.go

@@ -242,6 +241,10 @@ func locsToRanges(buf []byte, locs [][]int) []protocol.Range {
 	prevStart := 0
 	prevStartLine := 0

+	c := columnHelper{


I noticed some other opportunities for optimizations around newline handling. I've run out of steam for searcher improvements though, so just stopped here :)

jtibshirani · 2024-01-11T20:44:40Z

cmd/searcher/internal/search/matchtree.go

 func mergeMatches(matches [][]int, limit int) [][]int {
+	if len(matches) == 0 {


This revisits the decision I made in a previous PR but is still in line with our discussion. I now think it's really nice to match the Zoekt behavior so we can share optimizations, mental models, etc.

jtibshirani · 2024-01-11T20:47:29Z

cmd/searcher/internal/search/chunk.go

+	c.lastOffset = offset
+	c.lastRuneCount = runeCount
+
+	return runeCount


In Zoekt, this line reads return runeCount + 1. However, that caused searcher tests to fail! It seems that searcher considers the column value start at 0, whereas Zoekt starts it at 1.

I plan to push a commit to update this to be consistent with Zoekt, and adapt all the tests. Since this discrepancy has been around for a while, I'm guessing we don't have frontend logic relying on the column value. Question for reviewer: could you double-check me on this?

cc @camdencheek who would know if column is used since I believe he included it in the chunkmatch API. If we don't actually use this value, can't we just remove it?

I double checked, looks good.

Making it consistent with Zoekt sounds good to me. IIRC, this was all just trying to preserve historical behavior.

We do use column in the web client. However, we could probably get rid of it in the API because both column and line can be calculated on the fly if we we have the chunk content (and the line number it starts on). The downside to dropping it from the API is that Zoekt can calculate lines more efficiently because it stores the offsets of newlines.

Thanks for the context. I looked into this further, and the logic here is actually correct:

The GraphQL API converts the column to character, which is 0-indexed

When frontend processes Zoekt matches, it subtracts 1 from the column

Definitely surprising, but no bug to fix at the moment.

keegancsmith

LGTM!

keegancsmith · 2024-01-12T13:16:40Z

cmd/searcher/internal/search/chunk.go

+		lastChunk := &chunks[len(chunks)-1] // pointer for mutability
+		if lastChunk.cover.End.Line+interChunkLines >= rr.Start.Line {
+			// The current range overlaps with the current chunk, so merge them
+			lastChunk.ranges = ranges[i-len(lastChunk.ranges) : i+1]


this logic is kinda tricky to get right. Why not just use append? If I am not mistaken append won't allocate, it will reuse the same underlying array as ranges. I guess the tricky thing there is on another level in that you worry about reusing the slice.

But either way I like that you re-use the underlying array. We don't do this in zoekt but should to avoid all the extra allocations. Did the benchmarks in searcher make you go down this path?

This is a straight refactor, I didn't write any new logic here. I just moved these methods from search_structural to chunk as a clean-up, since they were shared between core searcher logic and structural search. Here's the commit: https://github.com/sourcegraph/sourcegraph/pull/59527/commits/d4f7087913c9413d61187a0a3d20b34858d8a03f.

I think there's a lot of room for improvement here, but I'd like to scope this PR and not touch this logic.

Note to self: in the future, I'll pull these moves out into their own PR since it makes the commit history clearer

keegancsmith · 2024-01-12T13:18:40Z

cmd/searcher/internal/search/chunk.go

+		End: protocol.Location{
+			Offset: lastLineEnd,
+			Line:   inputRange.End.Line,
+			Column: int32(utf8.RuneCount(buf[lastLineStart:lastLineEnd])),


Just to confirm, this doesn't have the hidden n^2 because you only calculate this column once per line?

That's my understanding as well. Also to clarify, this is pre-existing logic that I didn't touch, it's just moved: https://github.com/sourcegraph/sourcegraph/commit/d4f7087913c9413d61187a0a3d20b34858d8a03f.

keegancsmith · 2024-01-12T13:20:38Z

cmd/searcher/internal/search/chunk.go

+	c.lastOffset = offset
+	c.lastRuneCount = runeCount
+
+	return runeCount


cc @camdencheek who would know if column is used since I believe he included it in the chunkmatch API. If we don't actually use this value, can't we just remove it?

I double checked, looks good.

jtibshirani added 4 commits January 11, 2024 12:28

Pull chunk methods out to new file

d4f7087

Remove overlapping matches

ea71153

Optimize column calculation

fbd16fa

Run bazel configure

08fe553

cla-bot bot added the cla-signed label Jan 11, 2024

jtibshirani mentioned this pull request Jan 11, 2024

Searcher: optimize AND/ OR patterns #59038

Closed

jtibshirani commented Jan 11, 2024

View reviewed changes

jtibshirani marked this pull request as ready for review January 11, 2024 21:12

jtibshirani requested review from camdencheek and a team January 11, 2024 21:12

keegancsmith approved these changes Jan 12, 2024

View reviewed changes

jtibshirani merged commit b5937ed into main Jan 12, 2024
15 checks passed

jtibshirani deleted the jtibs/matches branch January 12, 2024 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searcher: improve how matches are built #59527

Searcher: improve how matches are built #59527

jtibshirani commented Jan 11, 2024 •

edited

Loading

jtibshirani Jan 11, 2024

jtibshirani Jan 11, 2024 •

edited

Loading

jtibshirani Jan 11, 2024

keegancsmith Jan 12, 2024

camdencheek Jan 12, 2024

jtibshirani Jan 12, 2024

keegancsmith left a comment

keegancsmith Jan 12, 2024

jtibshirani Jan 12, 2024 •

edited

Loading

jtibshirani Jan 12, 2024

keegancsmith Jan 12, 2024

jtibshirani Jan 12, 2024

keegancsmith Jan 12, 2024

		func mergeMatches(matches [][]int, limit int) [][]int {
		if len(matches) == 0 {

Searcher: improve how matches are built #59527

Searcher: improve how matches are built #59527

Conversation

jtibshirani commented Jan 11, 2024 • edited Loading

Test plan

Choose a reason for hiding this comment

jtibshirani Jan 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keegancsmith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Jan 11, 2024 •

edited

Loading

jtibshirani Jan 11, 2024 •

edited

Loading

jtibshirani Jan 12, 2024 •

edited

Loading