-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chunkmatches: reuse last calculated column when filling #711
Conversation
This doesn't change the logic, it just moves it into a struct so I can make it smarter. Additionally we add a benchmark and test in this commit. The next commit will contain the perf improvement.
O(nm) to O(n): benchstat before.txt after.txt name old time/op new time/op delta ColumnHelper-32 299ms ± 2% 0ms ± 2% -99.97% (p=0.000 n=10+10)
Here are my journal notes for the pprof part of this investigation. It contains useful information for future performance debugging. [2024-01-09 Tue 10:51] Gonna try grab a pprof during the search. It is only 2 seconds, so unsure how useful it will be. Maybe can try adding a loop. I can just visit https://sourcegraph.sourcegraph.com/-/debug/proxies/indexed-search-0/debug/pprof/profile I then used devtools to
Next thing I needed was the zoekt binary used. I did this by getting the version of s2 at https://sourcegraph.sourcegraph.com/__version and then getting the binary from the docker container:
Then so I could use source code listing
Turns out we spend 96.50% in Line 309 in e92f6c7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Great find!
I think that equivalent logic exists in searcher. We should probably implement this same thing there as well.
// columnHelper is a helper struct which caches the number of runes last | ||
// counted. If we naively use utf8.RuneCount for each match on a line, this | ||
// leads to an O(nm) algorithm where m is the number of matches and n is the | ||
// length of the line. Aassuming we our candidates are increasing in offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// length of the line. Aassuming we our candidates are increasing in offset | |
// length of the line. Assuming we our candidates are increasing in offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to check: are we always sure we can assume our candidates are increasing in offset? I can't remember if this is always true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, reading the implementation, I guess we just fall back to the less performant version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this invariant is true, because in gatherMatches
we always make sure to sort by byteOffset
.
Maybe we could update the comments to make it clear this invariant is assumed, and treat the unsorted case as an error rather than being expected? That way if we ever introduce a bug here, we don't silently fall back to an O(n^2) algorithm... much harder to track down than a clear error in testing.
General thought: if invariants are too tricky to reason about, sometimes I just explicitly add a (re)sort! I believe Go's default sort is very fast when the input is already sorted. This bounds the worst case nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also checked the invariants. We also have the invariant that things don't overlap, which is also important since we lookup the end column.
The sorted invariant is actually quite important for other bits of code like chunkCandidates. So what I did was add a sorted check which loudly complains and then sorts if the invariant is broken.
Initially I pretended I was a haskell programmer and added a special type which guaranteed this, but TBH it felt quite overengineered. Happy to try it out if there is interest, but for now gonna merge with extra perf invariant documentation and sort check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to fix this!
// columnHelper is a helper struct which caches the number of runes last | ||
// counted. If we naively use utf8.RuneCount for each match on a line, this | ||
// leads to an O(nm) algorithm where m is the number of matches and n is the | ||
// length of the line. Aassuming we our candidates are increasing in offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this invariant is true, because in gatherMatches
we always make sure to sort by byteOffset
.
Maybe we could update the comments to make it clear this invariant is assumed, and treat the unsorted case as an error rather than being expected? That way if we ever introduce a bug here, we don't silently fall back to an O(n^2) algorithm... much harder to track down than a clear error in testing.
General thought: if invariants are too tricky to reason about, sometimes I just explicitly add a (re)sort! I believe Go's default sort is very fast when the input is already sorted. This bounds the worst case nicely.
Assigning this to myself! Said differently: please don't work on this concurrently, as it will cause a lot of conflicts with my in-progress searcher optimizations (https://github.com/sourcegraph/sourcegraph/issues/59038) :) |
This PR improves how searcher creates matches, making it more consistent with how it's done in Zoekt. Changes: * Pull chunking logic out of structural search code and into its own file `chunk.go` * Remove overlapping ranges (this is what Zoekt does when chunk matches are enabled) * Optimize the column calculation using the same strategy from Zoekt ([zoekt#711](sourcegraph/zoekt#711))
This change uses the fact that candidate matches should be increasing in byte offset, to avoid recounting runes on a line. Before this change if you have many matches on the same line we would call
utf8.RuneCount
for each match, which is aO(nm)
algorithm wheren
is your line length andm
is the number of matches. After this change the complexity isO(n)
.I came across this while investigating slow performance for searching the string "dev" on s2 taking 2s if the match limits where 100k instead of 10k. With 10k it would take 0.04s. It turns out with the larger limit we ended up searching a file were the word dev appeared many times on one line. Running a profiler against the service came up with 96% of CPU time in
utf8.RuneCount
.This commit adds a benchmark for the helper introduced to reuse RuneCounts. Unsurprisingly the difference is massive between
O(nm)
andO(n)
:)See details in a comment below for how I obtained the profiles and the information from them.
Test Plan: Added tests and benchmarks.