chunkmatches: reuse last calculated column when filling #711

keegancsmith · 2024-01-09T10:54:41Z

This change uses the fact that candidate matches should be increasing in byte offset, to avoid recounting runes on a line. Before this change if you have many matches on the same line we would call utf8.RuneCount for each match, which is a O(nm) algorithm where n is your line length and m is the number of matches. After this change the complexity is O(n).

I came across this while investigating slow performance for searching the string "dev" on s2 taking 2s if the match limits where 100k instead of 10k. With 10k it would take 0.04s. It turns out with the larger limit we ended up searching a file were the word dev appeared many times on one line. Running a profiler against the service came up with 96% of CPU time in utf8.RuneCount.

This commit adds a benchmark for the helper introduced to reuse RuneCounts. Unsurprisingly the difference is massive between O(nm) and O(n) :)

name             old time/op  new time/op  delta
ColumnHelper-32   299ms ± 2%     0ms ± 2%  -99.97%  (p=0.000 n=10+10)

See details in a comment below for how I obtained the profiles and the information from them.

Test Plan: Added tests and benchmarks.

This doesn't change the logic, it just moves it into a struct so I can make it smarter. Additionally we add a benchmark and test in this commit. The next commit will contain the perf improvement.

O(nm) to O(n): benchstat before.txt after.txt name old time/op new time/op delta ColumnHelper-32 299ms ± 2% 0ms ± 2% -99.97% (p=0.000 n=10+10)

keegancsmith · 2024-01-09T10:56:27Z

Here are my journal notes for the pprof part of this investigation. It contains useful information for future performance debugging.

[2024-01-09 Tue 10:51] Gonna try grab a pprof during the search. It is only 2 seconds, so unsure how useful it will be. Maybe can try adding a loop.

I can just visit https://sourcegraph.sourcegraph.com/-/debug/proxies/indexed-search-0/debug/pprof/profile

I then used devtools to Copy as cURL the request. I then did the same for making the search request. For both I copied into a bash function and added the --silent --show-error flags for my sanity. The script looked like this, curl flags removed.

#!/usr/bin/env bash

set -e

function fetch_profile {
    echo "profile start"
    curl 'https://sourcegraph.sourcegraph.com/-/debug/proxies/indexed-search-0/debug/pprof/profile' > /tmp/cpu.pprof
    echo "profile done"
}

function search {
    echo "search start"
    curl 'https://sourcegraph.sourcegraph.com/search/stream?q=context%3Aglobal%20repo%3Agithub.com%2Fsourcegraph%2Fsourcegraph%20%20content%3A%22dev%22%20&v=V3&t=newStandardRC1&sm=0&display=1500&cm=t&trace=1&feat=search-debug' > /dev/null
    echo "search done"
}

fetch_profile &

while jobs %%; do
    search
done

Next thing I needed was the zoekt binary used. I did this by getting the version of s2 at https://sourcegraph.sourcegraph.com/__version and then getting the binary from the docker container:

docker pull sourcegraph/indexed-searcher:257084_2024-01-09_5.2-9efa6c7e2efb
docker create sourcegraph/indexed-searcher:257084_2024-01-09_5.2-9efa6c7e2efb
docker cp 74ad840ae6a8f51ddec7ff4a382660ca8a62bb4d47d7730f355d71f9b68cde15:/usr/local/bin/zoekt-webserver /tmp/
docker rm -v 74ad840ae6a8f51ddec7ff4a382660ca8a62bb4d47d7730f355d71f9b68cde15

go tool pprof -http 127.0.0.1:6062 zoekt-webserver cpu.pprof

Then so I could use source code listing

go tool pprof -trim_path external/com_github_sourcegraph_zoekt/ -source_path ~/src/github.com/sourcegraph/zoekt zoekt-webserver cpu.pprof

Turns out we spend 96.50% in utf8.RuneCount inside of fillContentChunkMatches. This is an issue only with chunk matches, not the original format. We calculate the column, and it has a hidden O(n^2) algorithm in it! If you have a long line, we basically do a O(n) operation on that line per match, where n is the line length.

zoekt/contentprovider.go

Line 309 in e92f6c7

Column: uint32(utf8.RuneCount(data[startLineOffset:startOffset]) + 1),

(pprof) list fillContentChunkMatches
Total: 25.15s
ROUTINE ======================== github.com/sourcegraph/zoekt.(*contentProvider).fillContentChunkMatches in contentprovider.go
      10ms     24.36s (flat, cum) 96.86% of Total
         .          .    291:func (p *contentProvider) fillContentChunkMatches(ms []*candidateMatch, numContextLines int) []ChunkMatch {
         .       20ms    292:	newlines := p.newlines()
         .       30ms    293:	chunks := chunkCandidates(ms, newlines, numContextLines)
         .          .    294:	data := p.data(false)
         .          .    295:	chunkMatches := make([]ChunkMatch, 0, len(chunks))
         .          .    296:	for _, chunk := range chunks {
         .       20ms    297:		ranges := make([]Range, 0, len(chunk.candidates))
         .          .    298:		var symbolInfo []*Symbol
         .          .    299:		for i, cm := range chunk.candidates {
         .          .    300:			startOffset := cm.byteOffset
         .          .    301:			endOffset := cm.byteOffset + cm.byteMatchSz
      10ms       20ms    302:			startLine, startLineOffset, _ := newlines.atOffset(startOffset)
         .          .    303:			endLine, endLineOffset, _ := newlines.atOffset(endOffset)
         .          .    304:
         .          .    305:			ranges = append(ranges, Range{
         .          .    306:				Start: Location{
         .          .    307:					ByteOffset: startOffset,
         .          .    308:					LineNumber: uint32(startLine),
         .     12.42s    309:					Column:     uint32(utf8.RuneCount(data[startLineOffset:startOffset]) + 1),
         .          .    310:				},
         .          .    311:				End: Location{
         .          .    312:					ByteOffset: endOffset,
         .          .    313:					LineNumber: uint32(endLine),
         .     11.85s    314:					Column:     uint32(utf8.RuneCount(data[endLineOffset:endOffset]) + 1),
         .          .    315:				},
         .          .    316:			})
         .          .    317:
         .          .    318:			if cm.symbol {
         .          .    319:				if symbolInfo == nil {

camdencheek

Wow! Great find!

I think that equivalent logic exists in searcher. We should probably implement this same thing there as well.

camdencheek · 2024-01-09T15:16:50Z

contentprovider.go

+// columnHelper is a helper struct which caches the number of runes last
+// counted. If we naively use utf8.RuneCount for each match on a line, this
+// leads to an O(nm) algorithm where m is the number of matches and n is the
+// length of the line. Aassuming we our candidates are increasing in offset


Suggested change

// length of the line. Aassuming we our candidates are increasing in offset

// length of the line. Assuming we our candidates are increasing in offset

Just to check: are we always sure we can assume our candidates are increasing in offset? I can't remember if this is always true.

Oh, reading the implementation, I guess we just fall back to the less performant version.

I believe this invariant is true, because in gatherMatches we always make sure to sort by byteOffset.

Maybe we could update the comments to make it clear this invariant is assumed, and treat the unsorted case as an error rather than being expected? That way if we ever introduce a bug here, we don't silently fall back to an O(n^2) algorithm... much harder to track down than a clear error in testing.

General thought: if invariants are too tricky to reason about, sometimes I just explicitly add a (re)sort! I believe Go's default sort is very fast when the input is already sorted. This bounds the worst case nicely.

I also checked the invariants. We also have the invariant that things don't overlap, which is also important since we lookup the end column.

The sorted invariant is actually quite important for other bits of code like chunkCandidates. So what I did was add a sorted check which loudly complains and then sorts if the invariant is broken.

Initially I pretended I was a haskell programmer and added a special type which guaranteed this, but TBH it felt quite overengineered. Happy to try it out if there is interest, but for now gonna merge with extra perf invariant documentation and sort check.

jtibshirani

Great to fix this!

jtibshirani · 2024-01-09T16:47:04Z

contentprovider.go

+// columnHelper is a helper struct which caches the number of runes last
+// counted. If we naively use utf8.RuneCount for each match on a line, this
+// leads to an O(nm) algorithm where m is the number of matches and n is the
+// length of the line. Aassuming we our candidates are increasing in offset


I believe this invariant is true, because in gatherMatches we always make sure to sort by byteOffset.

Maybe we could update the comments to make it clear this invariant is assumed, and treat the unsorted case as an error rather than being expected? That way if we ever introduce a bug here, we don't silently fall back to an O(n^2) algorithm... much harder to track down than a clear error in testing.

General thought: if invariants are too tricky to reason about, sometimes I just explicitly add a (re)sort! I believe Go's default sort is very fast when the input is already sorted. This bounds the worst case nicely.

jtibshirani · 2024-01-09T20:10:46Z

I think that equivalent logic exists in searcher. We should probably implement this same thing there as well.

Assigning this to myself! Said differently: please don't work on this concurrently, as it will cause a lot of conflicts with my in-progress searcher optimizations (https://github.com/sourcegraph/sourcegraph/issues/59038) :)

This PR improves how searcher creates matches, making it more consistent with how it's done in Zoekt. Changes: * Pull chunking logic out of structural search code and into its own file `chunk.go` * Remove overlapping ranges (this is what Zoekt does when chunk matches are enabled) * Optimize the column calculation using the same strategy from Zoekt ([zoekt#711](sourcegraph/zoekt#711))

keegancsmith added 2 commits January 9, 2024 12:43

factor out column calculation in chunk matches into helper

cfb7fe3

This doesn't change the logic, it just moves it into a struct so I can make it smarter. Additionally we add a benchmark and test in this commit. The next commit will contain the perf improvement.

reuse position of last column value

784a3fc

O(nm) to O(n): benchstat before.txt after.txt name old time/op new time/op delta ColumnHelper-32 299ms ± 2% 0ms ± 2% -99.97% (p=0.000 n=10+10)

keegancsmith requested review from jtibshirani, camdencheek and stefanhengl January 9, 2024 10:54

camdencheek approved these changes Jan 9, 2024

View reviewed changes

jtibshirani approved these changes Jan 9, 2024

View reviewed changes

warn if not sorted

c72530e

keegancsmith merged commit 7487a0d into main Jan 10, 2024
8 checks passed

keegancsmith deleted the k/chunk-matches-perf branch January 10, 2024 09:45

jtibshirani mentioned this pull request Jan 11, 2024

Searcher: improve how matches are built sourcegraph/sourcegraph-public-snapshot#59527

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunkmatches: reuse last calculated column when filling #711

chunkmatches: reuse last calculated column when filling #711

keegancsmith commented Jan 9, 2024 •

edited

Loading

keegancsmith commented Jan 9, 2024

camdencheek left a comment

camdencheek Jan 9, 2024

camdencheek Jan 9, 2024

camdencheek Jan 9, 2024

jtibshirani Jan 9, 2024 •

edited

Loading

keegancsmith Jan 10, 2024

jtibshirani left a comment

jtibshirani Jan 9, 2024 •

edited

Loading

jtibshirani commented Jan 9, 2024

	// length of the line. Aassuming we our candidates are increasing in offset
	// length of the line. Assuming we our candidates are increasing in offset

chunkmatches: reuse last calculated column when filling #711

chunkmatches: reuse last calculated column when filling #711

Conversation

keegancsmith commented Jan 9, 2024 • edited Loading

keegancsmith commented Jan 9, 2024

camdencheek left a comment

Choose a reason for hiding this comment

camdencheek Jan 9, 2024

Choose a reason for hiding this comment

camdencheek Jan 9, 2024

Choose a reason for hiding this comment

camdencheek Jan 9, 2024

Choose a reason for hiding this comment

jtibshirani Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

keegancsmith Jan 10, 2024

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

jtibshirani Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

jtibshirani commented Jan 9, 2024

keegancsmith commented Jan 9, 2024 •

edited

Loading

jtibshirani Jan 9, 2024 •

edited

Loading

jtibshirani Jan 9, 2024 •

edited

Loading