-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark and perf fixes #636
Conversation
@@ -12,21 +12,26 @@ | |||
// See the License for the specific language governing permissions and | |||
// limitations under the License. | |||
|
|||
use crate::components::component_service::{env_vars, wait_for_startup, ComponentService}; | |||
use crate::components::component_service::{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was taken from another benchmark implementation done while we were in Barcelona. However this k8s file uses pod level yaml files requiring load balancer to be spinned up separately (requiring time) and also ending up calling the creation of K8sComponentService::new(..) multiple times when in need of a cluster. All of this has k8s implementations across components has to change as far as I see.
I know this is not the real concern of this PR, but since there is a change in these files, I thought of giving a heads-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the gRPC connection setup changed in these implemenations, and we are not using the k8s
implementations at the moment. So you must be right but this is out of scope of this change
@vigoo Curiosity driven question: I feel this is the most interest part in the PR. Before I dig too much into how this is fixed;
So what is the general solution if the capacity reaches, and we get a lag error ? Say even at higher capacity, do we have chances of these request handlers slow and finally miss out on events? Also, I didn't quite get this -- After thinking, did you mean per-worker parallel connection, and making it bounded was the main culprit of ending up in dropping events? |
This is all about invoke-and-await. It is implemented by doing a general "invoke", which puts the invocation request in the worker's invocation queue and activates the worker if needed. Each worker has it's own invocation queue processor which will eventually do the invocation and store the results in the worker's memory (and in the oplog), which can be looked up by the invocation's idempotency key. At the same time the "await" version of the invocation API waits for this to happen and that is implemented with the mentioned broadcast channel. (There could be per-worker broadcast channels instead, I did not investigate the performance difference - for now it remains a per-executor events channel). To reproduce the issue we just had to do many parallel invoke-and-await requests in the benchmarks. There were two ways how this could have gone wrong:
Both are fixed in the PR by the new logic which does the following:
That said it would be better to drop the subscription in the last point temporarily to not cause further lag, but even with this fix the issues never came again with any benchmark setups and it started to break on other things (like allocating so many virtual memory that my computer almost died :D )
It's a tonic grpc server setting that controls the number of concurrent requests in one http2 connection. I felt this is a random thing to set (as there can be multiple connections etc) and we did not set it in our other services so I removed it. |
Okay making sense. |
/run-benchmark |
Resolves #624
Reviewed, fixed and analysed all the benchmarks and made changes based on the results in both the services and the test framework.
In details:
Compilation cache service was using wrong wasmtime flags
Whether to include backtrace information in the compiled WASMs was not explicitly configured so it took it from an environment variable, which was set differently for the compilation service and the worker executor. Because of this the precompiled images were incompatible with the executor so it could never use those.
Retry and timeout in benchmarks
Before switching to reuse gRPC connections, benchmarks were very quickly randomly failing on connection issues. It was not possible to handle them (with retries for example) because the test framework just panicked on every error. I refactored the test framework interface to return
Result
s, but kept the "panicking mode" as well depending on which interface you import (TestDsl
orTestDslUnsafe
).This way the benchmarks can retry and be more reliable, and more similar to real-life scenarios.
Also there was an issue in the worker executor (explained below) that caused inifinte invocations - the invoke and await never returned. This was also not handled in benchmarks so it caused the benchmark look stalled. Now we have timeouts as well.
Both retries and timeouts are measured by the benchmark and outputted beside the actual latencies.
Worker executor event channel issue
Worker executor uses an (per-worker) global event channel to broadcast ready invocations for the request handlers that are waiting for them (invoke-and-awaits). It was using
tokio::sync::broadcast
with a bounded channel configured to very small capacity, because I misunderstood how it works (I expected it will cause back pressure on the insert side). It is not how it works, see https://docs.rs/tokio/latest/tokio/sync/broadcast/index.html#laggingThis caused events to be dropped and invocations continuing to wait for them forever. Now this logic is fixed in general, and also the channel is configured with a high capacity. Also we no longer limit the per-connection parallel connection count in the worker executor.
Benchmarks were ran on debug builds
Benchmarks by default were run on debug builds of the services (both locally and on CI) which does not make sense. Now we run them on release builds.
Fixed warmups in benchmarks
Some of the benchmarks did not warmup as it was originally planned, fixed those.
Single reused gRPC channel
It turned out that the intended way to do parallel gRPC requests with
tonic
is to simply clone the connections. Under the hood it maintains a http2 connection that supports multiplexing and also deals with connection errors and so on.I have introduced two reusable components:
GrpcClient
andMultiTargetGrpcClient
which are caching such a connection, but also detects if the target service starts to respond with gRPCUNAVAILABLE
. In this case it invalidates the connection and tries to reconnect with an exponential backoff.Changed every gRPC communication (all inter-service communication and also the test framework) to use this technique.
Benchmarks shows that it works and actually much faster than opening a connection per request.
Note that we already used this in one place, between worker-service and worker-executors, but we never handled the unavailable error which I think could lead to infinitely stuck connections (and can be one of the reasons of the known sharding test errors).
Single gRPC channel optional in tests
Unfortunately in tests I had to make the above thing optional, because
tokio::test
creates a separate tokio runtime for each test case, and it's not compatible with reusing the channel: hyperium/tonic#942 (comment)Caching component metadata in worker-executor
Worker executors were always caching components but only recently started to always require component metadata as well, and we forgot to cache those. Now it is cached.
One thing that is not cached yet is the case when we want to create a new worker with the latest version - as we don't know if there was any component updates we always have to query the latest metadata in those cases. This can be improved if needed by component service broadcasting component updates to worker executors, if we decide to, but it's out of scope of this change.
Some benchmarked binaries were debug
The non-golem http server implementation was compiled to debug which caused it to run significantly slower than the Golem version of the same code, which was compiled to release.
Also the components for the RPC test case were compiled to debug.
This is all fixed now and everything is measuring release builds.
RPC benchmark was only running a single worker pair
Somehow the RPC benchmark missed the proper setup and it always ran only a single pair of workers, no matter what parameters it received. This made it impossible to differentiate between local vs remote rpc calls, for example. Fixed.
Native throughput benchmark using blocking tasks
Although not really changing the results, the native throughput benchmark server is now properly using
spawn_blocking
around the CPU-intensive request handlers.Benchmark code redundancy
All the benchmarks were full of copy-pasted code which is now gone.
Simplified code in worker-service
The worker service's
service/worker/default.rs
which got rewritten a bit because of the gRPC client changes had a lot of unnecessaryclone()
s andasync {}
blocks which I cleaned up.Primary and secondary benchmark keys
Measuring latency per worker can be useful for debugging benchmark results but it is just noise for an overall view of how each benchmark performs, so I introduced a way to distinguish primary and secondary benchmark results, and a CLI option to only show the primary ones.
Benchmark on CI
The benchmarks on CI were using the default settings of size, iterations, cluster size, etc which is incorrect (never intended those numbers to be applied to each benchmark, it was just an example).
Now the CI job uses the benchmark runner's test matrix feature to run multiple combinations, and also uses the above mentioned "primary only" mode.