Benchmark and perf fixes #636

vigoo · 2024-07-04T09:24:55Z

Resolves #624

Reviewed, fixed and analysed all the benchmarks and made changes based on the results in both the services and the test framework.

In details:

Compilation cache service was using wrong wasmtime flags
Whether to include backtrace information in the compiled WASMs was not explicitly configured so it took it from an environment variable, which was set differently for the compilation service and the worker executor. Because of this the precompiled images were incompatible with the executor so it could never use those.

Retry and timeout in benchmarks
Before switching to reuse gRPC connections, benchmarks were very quickly randomly failing on connection issues. It was not possible to handle them (with retries for example) because the test framework just panicked on every error. I refactored the test framework interface to return Results, but kept the "panicking mode" as well depending on which interface you import (TestDsl or TestDslUnsafe).
This way the benchmarks can retry and be more reliable, and more similar to real-life scenarios.

Also there was an issue in the worker executor (explained below) that caused inifinte invocations - the invoke and await never returned. This was also not handled in benchmarks so it caused the benchmark look stalled. Now we have timeouts as well.

Both retries and timeouts are measured by the benchmark and outputted beside the actual latencies.

Worker executor event channel issue

Worker executor uses an (per-worker) global event channel to broadcast ready invocations for the request handlers that are waiting for them (invoke-and-awaits). It was using tokio::sync::broadcast with a bounded channel configured to very small capacity, because I misunderstood how it works (I expected it will cause back pressure on the insert side). It is not how it works, see https://docs.rs/tokio/latest/tokio/sync/broadcast/index.html#lagging
This caused events to be dropped and invocations continuing to wait for them forever. Now this logic is fixed in general, and also the channel is configured with a high capacity. Also we no longer limit the per-connection parallel connection count in the worker executor.

Benchmarks were ran on debug builds

Benchmarks by default were run on debug builds of the services (both locally and on CI) which does not make sense. Now we run them on release builds.

Fixed warmups in benchmarks

Some of the benchmarks did not warmup as it was originally planned, fixed those.

Single reused gRPC channel

It turned out that the intended way to do parallel gRPC requests with tonic is to simply clone the connections. Under the hood it maintains a http2 connection that supports multiplexing and also deals with connection errors and so on.

I have introduced two reusable components: GrpcClient and MultiTargetGrpcClient which are caching such a connection, but also detects if the target service starts to respond with gRPC UNAVAILABLE. In this case it invalidates the connection and tries to reconnect with an exponential backoff.

Changed every gRPC communication (all inter-service communication and also the test framework) to use this technique.
Benchmarks shows that it works and actually much faster than opening a connection per request.

Note that we already used this in one place, between worker-service and worker-executors, but we never handled the unavailable error which I think could lead to infinitely stuck connections (and can be one of the reasons of the known sharding test errors).

Single gRPC channel optional in tests

Unfortunately in tests I had to make the above thing optional, because tokio::test creates a separate tokio runtime for each test case, and it's not compatible with reusing the channel: hyperium/tonic#942 (comment)

Caching component metadata in worker-executor

Worker executors were always caching components but only recently started to always require component metadata as well, and we forgot to cache those. Now it is cached.
One thing that is not cached yet is the case when we want to create a new worker with the latest version - as we don't know if there was any component updates we always have to query the latest metadata in those cases. This can be improved if needed by component service broadcasting component updates to worker executors, if we decide to, but it's out of scope of this change.

Some benchmarked binaries were debug

The non-golem http server implementation was compiled to debug which caused it to run significantly slower than the Golem version of the same code, which was compiled to release.
Also the components for the RPC test case were compiled to debug.

This is all fixed now and everything is measuring release builds.

RPC benchmark was only running a single worker pair

Somehow the RPC benchmark missed the proper setup and it always ran only a single pair of workers, no matter what parameters it received. This made it impossible to differentiate between local vs remote rpc calls, for example. Fixed.

Native throughput benchmark using blocking tasks
Although not really changing the results, the native throughput benchmark server is now properly using spawn_blocking around the CPU-intensive request handlers.

Benchmark code redundancy

All the benchmarks were full of copy-pasted code which is now gone.

Simplified code in worker-service

The worker service's service/worker/default.rs which got rewritten a bit because of the gRPC client changes had a lot of unnecessary clone()s and async {} blocks which I cleaned up.

Primary and secondary benchmark keys

Measuring latency per worker can be useful for debugging benchmark results but it is just noise for an overall view of how each benchmark performs, so I introduced a way to distinguish primary and secondary benchmark results, and a CLI option to only show the primary ones.

Benchmark on CI

The benchmarks on CI were using the default settings of size, iterations, cluster size, etc which is incorrect (never intended those numbers to be applied to each benchmark, it was just an example).

Now the CI job uses the benchmark runner's test matrix feature to run multiple combinations, and also uses the above mentioned "primary only" mode.

afsalthaj · 2024-07-04T14:15:37Z

golem-test-framework/src/components/component_service/k8s.rs

@@ -12,21 +12,26 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

-use crate::components::component_service::{env_vars, wait_for_startup, ComponentService};
+use crate::components::component_service::{


I know this was taken from another benchmark implementation done while we were in Barcelona. However this k8s file uses pod level yaml files requiring load balancer to be spinned up separately (requiring time) and also ending up calling the creation of K8sComponentService::new(..) multiple times when in need of a cluster. All of this has k8s implementations across components has to change as far as I see.

I know this is not the real concern of this PR, but since there is a change in these files, I thought of giving a heads-up

Only the gRPC connection setup changed in these implemenations, and we are not using the k8s implementations at the moment. So you must be right but this is out of scope of this change

afsalthaj · 2024-07-05T03:57:18Z

@vigoo Curiosity driven question:

I feel this is the most interest part in the PR. Before I dig too much into how this is fixed;

This caused events to be dropped and invocations continuing to wait for them forever. Now this logic is fixed in general, and also the channel is configured with a high capacity. Also we no longer limit the per-connection parallel connection count in the worker executor.

So what is the general solution if the capacity reaches, and we get a lag error ? Say even at higher capacity, do we have chances of these request handlers slow and finally miss out on events?
May be Tokio doubt: is there a mechanism in this approach, that we would never end up missing any message regardless of the pressure? Also could you explain, even with a small capacity, why are these receivers (request handlers) being slow and not able to process these ready invocations? Is the main reason too many requests against that worker? Can I say this happens only when invoke-and-await, and not for invoke? In which part of the benchmark did we capture this? Too many requests to invoke functions within a worker?

Also, I didn't quite get this Also we no longer limit the per-connection parallel connection count in the worker executor. It will be nice if you explain this a bit more :)

-- After thinking, did you mean per-worker parallel connection, and making it bounded was the main culprit of ending up in dropping events?

vigoo · 2024-07-05T06:44:59Z

@vigoo Curiosity driven question:

This is all about invoke-and-await. It is implemented by doing a general "invoke", which puts the invocation request in the worker's invocation queue and activates the worker if needed. Each worker has it's own invocation queue processor which will eventually do the invocation and store the results in the worker's memory (and in the oplog), which can be looked up by the invocation's idempotency key. At the same time the "await" version of the invocation API waits for this to happen and that is implemented with the mentioned broadcast channel. (There could be per-worker broadcast channels instead, I did not investigate the performance difference - for now it remains a per-executor events channel).

To reproduce the issue we just had to do many parallel invoke-and-await requests in the benchmarks.

There were two ways how this could have gone wrong:

the one I mentioned in the PR description, if too many invocations finish the same time, it pushes many results in the broadcast channel and even though the receivers are not slow at all (as you ask) there is nothing in the tokio runtime that can guarantee that the receivers will be scheduled before the channel fills if it's small. So this way there were events that were dropped (proved it with logs after I realized this can happen).
there was also a race condition in what the code first checked if the result is already available, and then subscribed to the broadcast channel, so it was possible that in between it missed an event.

Both are fixed in the PR by the new logic which does the following:

First we subscribe before doing anything
Then check if the result already is there, if yes, we immediately drop the subscription and return
Otherwise wait for the event
If we get a "lagged" error, we sleep a bit and retry the whole thing by first checking if the invocation is ready and then starting to receive events.

That said it would be better to drop the subscription in the last point temporarily to not cause further lag, but even with this fix the issues never came again with any benchmark setups and it started to break on other things (like allocating so many virtual memory that my computer almost died :D )

Also, I didn't quite get this Also we no longer limit the per-connection parallel connection count in the worker executor. It will be nice if you explain this a bit more :)

-- After thinking, did you mean per-worker parallel connection, and making it bounded was the main culprit of ending up in dropping events?

It's a tonic grpc server setting that controls the number of concurrent requests in one http2 connection. I felt this is a random thing to set (as there can be multiple connections etc) and we did not set it in our other services so I removed it.

afsalthaj · 2024-07-05T07:09:11Z

Okay making sense.

afsalthaj · 2024-07-05T07:10:53Z

/run-benchmark

vigoo added 12 commits July 2, 2024 09:33

Initial set of improvements

3a2d51c

Introduced a generic gRPC client using shared channel

5c8c55b

Component metadata caching in worker executor

84ab47e

Configured broadcast capacity

b6f50e3

Reenable throughput benchmark in CI

146053b

Regenerated the RPC benchmark test components in release

79ecea0

Fix error propagation and RPC benchmark

a1d213f

Remove verbose debug log

6736071

Reduced reundancy in benchmark implementations

b997597

Reduced unnecessary clones and async closures

e2c844a

primary-only benchmark option, and specific run matrices for CI

07a2a9b

Merge branch 'main' into benchmark-and-perf-fixes

0f69b37

afsalthaj reviewed Jul 4, 2024

View reviewed changes

vigoo marked this pull request as ready for review July 4, 2024 14:23

afsalthaj approved these changes Jul 5, 2024

View reviewed changes

vigoo merged commit ed5e0f3 into main Jul 5, 2024
14 checks passed

vigoo deleted the benchmark-and-perf-fixes branch July 5, 2024 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark and perf fixes #636

Benchmark and perf fixes #636

vigoo commented Jul 4, 2024 •

edited

Loading

afsalthaj Jul 4, 2024 •

edited

Loading

vigoo Jul 4, 2024

afsalthaj commented Jul 5, 2024 •

edited

Loading

vigoo commented Jul 5, 2024 •

edited

Loading

afsalthaj commented Jul 5, 2024

afsalthaj commented Jul 5, 2024

Benchmark and perf fixes #636

Benchmark and perf fixes #636

Conversation

vigoo commented Jul 4, 2024 • edited Loading

afsalthaj Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

vigoo Jul 4, 2024

Choose a reason for hiding this comment

afsalthaj commented Jul 5, 2024 • edited Loading

vigoo commented Jul 5, 2024 • edited Loading

afsalthaj commented Jul 5, 2024

afsalthaj commented Jul 5, 2024

vigoo commented Jul 4, 2024 •

edited

Loading

afsalthaj Jul 4, 2024 •

edited

Loading

afsalthaj commented Jul 5, 2024 •

edited

Loading

vigoo commented Jul 5, 2024 •

edited

Loading