Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shards: respect scheduler and use smarter synchronization for List #750

Merged
merged 1 commit into from
Mar 26, 2024

Conversation

keegancsmith
Copy link
Member

Previously List would never call proc.Yield, which broke the co-operative scheduler in the case of slow List calls. Additionally, we used a naive concurrency (a large buffered channel) which shows up in the profiler as 30% of CPU spent on chan_ related operations under List.

This commit follows how Search used to respect proc.Yield. See sched.go in 90ed7bf. We did not copy Search since it uses a more complicated implementation than we need since it supports streaming, while List is still batch only.

We needed to use errgroup to ensure we drained all channels in the case of an error. Previously we did not need to do this since the channels had a buffer size of len(shards), which gaurenteed nothing would ever block. Now channels are never larger than the number of workers (<= GOMAXPROCS).

Test Plan: go test covers the no error cases. In the case of errors we manually tested by running zoekt-webserver and adding a random context cancellation. We observed the error being reported and no List goroutines running.

Stacked on #749

@keegancsmith keegancsmith requested a review from a team March 26, 2024 14:35
@stefanhengl
Copy link
Member

Previously List would never call proc.Yield, which broke the co-operative scheduler in the case of slow List calls
😮

Base automatically changed from k/list to main March 26, 2024 15:06
@stefanhengl
Copy link
Member

Additionally, we used a naive concurrency (a large buffered channel) which shows up in the profiler as 30% of CPU spent on chan_ related operations under List.

I see that the new implementation has a much better concurrency, but shouldn't we have similar chan-related operations?

Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.

This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bf. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.

We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).

Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.

Co-authored-by: William Bezuidenhout <[email protected]>
@keegancsmith
Copy link
Member Author

Additionally, we used a naive concurrency (a large buffered channel) which shows up in the profiler as 30% of CPU spent on chan_ related operations under List.

I see that the new implementation has a much better concurrency, but shouldn't we have similar chan-related operations?

I'm not sure if go has poorer performance with very large buffered channels vs smaller ones. Given that we don't see chan_ operations appear for normal search in the flamegraphs makes me think that is the case? But this is totally unconfirmed.

A more likely explanation is that now that we are doing selectRepoSet (like for normal search) we just send much less work over the channels. So essentially the use of smaller buffered channels doesn't actually help.

However, the most important part of this change is the proc.Yield. And while I was introducing that I thought we might as well use a similar pattern for concurrency control.

@keegancsmith
Copy link
Member Author

keegancsmith commented Mar 26, 2024

Manual testing results:

zoekt on  k/concurrency [$!] ❄️ zoekt-env took 37s git diff
diff --git a/shards/shards.go b/shards/shards.go
index aa384371..1f3011fa 100644
--- a/shards/shards.go
+++ b/shards/shards.go
@@ -951,7 +951,7 @@ func (ss *shardedSearcher) List(ctx context.Context, r query.Q, opts *zoekt.List
        g, ctx := errgroup.WithContext(ctx)
 
        // Bound work by number of CPUs.
-       workers := min(runtime.GOMAXPROCS(0), len(shards))
+       workers := min(runtime.GOMAXPROCS(0), len(shards), 2)
 
        var (
                feeder = make(chan zoekt.Searcher, workers)
@@ -976,7 +976,12 @@ func (ss *shardedSearcher) List(ctx context.Context, r query.Q, opts *zoekt.List
        // send results down all. If an error is encountered we cancel the errgroup.
        for range workers {
                g.Go(func() error {
+                       i := 0
                        for s := range feeder {
+                               i++
+                               if i > 1 {
+                                       return fmt.Errorf("boom")
+                               }
                                result, err := listOneShard(ctx, s, r, opts)
                                if err != nil {
                                        return err

zoekt on  k/concurrency [$!] ❄️ zoekt-env go run ./cmd/zoekt-webserver -pprof
2024/03/26 17:25:07 loading 5 shard(s): github.com%2Fgolang%2Fgo_v16.00000.zoekt, github.com%2Fsourcegraph%2Fcody_v16.00000.zoekt, github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt, github.com%2Fsourcegraph%2Fsymf-private_v16.00000.zoekt, github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt

image

@keegancsmith keegancsmith merged commit 931974d into main Mar 26, 2024
7 of 8 checks passed
@keegancsmith keegancsmith deleted the k/concurrency branch March 26, 2024 15:27
keegancsmith added a commit that referenced this pull request Mar 27, 2024
… List (#750)"

This reverts commit 931974d.

After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.

Test Plan: go test
keegancsmith added a commit that referenced this pull request Mar 27, 2024
… List (#750)"

This reverts commit 931974d.

After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.

Test Plan: go test
keegancsmith added a commit that referenced this pull request Mar 27, 2024
… List (#750)" (#752)

This reverts commit 931974d.

After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.

Test Plan: go test
keegancsmith added a commit that referenced this pull request Mar 27, 2024
)

Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.

This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bf. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.

We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).

Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.

Co-authored-by: William Bezuidenhout <[email protected]>
(cherry picked from commit 931974d)
keegancsmith added a commit that referenced this pull request Mar 27, 2024
… List (#750)" (#752)

This reverts commit 931974d.

After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.

Test Plan: go test

(cherry picked from commit 8cf8887)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants