You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Asyncio-based load generation mode was introduced into Rally/OSB as part of this proposal: An actorless load generator #852, which allows managing multiple clients per process based on async event loops. Previously, Rally can only create one client per process, which indicates a hard limit on the upper bound of supported clients and the whole system is just hanging if there are too many clients.
Rally/OSB didn’t change the actor model to actorless eventually, but following two async features were implemented as part of this issue:
Add an asyncio-based load generator (#916, #935) - this is to increase # of clients for single-core instance
Allow to use significantly more clients (#944) - this is to increase # of client for multi-core instance
Problem Statement
OSB determines number of workers based on number of available cores and doesn't support auto-scaling workers during testing:
As shown above, the number of available cores can be overrode in the system config.
I tested on a 96-core instance (r5.24xlarge)
Instance Size<br>
vCPU
Memory (GiB)
Instance Storage (GB)
Networking Performance (Gbps)
EBS Bandwidth (Mbps)
r5.24xlarge
96
768
EBS-Only
25
19,000
After overriding this value, I observed almost 10x throughput increase for search operations (I believe this can be applied for indexing operations as well):
OSB should be able to dynamically increase number of workers to create sufficient clients so that target-throughput in the test procedure can be achieved.
Alternative Solution
Provide an option in --workload-params to specify how many workers should be created based on number of cores such as a integer value: workers-per-core.
This is a short term solution helping operators to better leverage on asyncio-based load generation to increase throughput. Instead of manually changing system config for each test run, adding a workload parameter for this can give more flexibility before long term solution is implemented and released.
Additional context
How many client do we need to achieve certain target throughput?
The number of clients needed to achieve a certain target throughput, given a single request latency, depends on various factors like network conditions, server capacity, and application specifics. To calculate this, we can use Little's Law, which states:
Throughput = Number of Requests/Response Time
Assuming we want to achieve a target throughput X and have a single request latency of Y, rearranging the formula gives:
Number of Requests = X×Y
As OSB runs each request synchronously for metric collection:
Number of Clients = Number of Requests
How does expanding the number of workers leads to a boost in throughput?
If we start more processes (workers) than the available vCPUs, then a few processes need to share a vCPU. That means that a process is periodically switched out in the context switch by the scheduler. If the process is CPU bound, then having more processes than vCPUs will slow down the overall tasks processing.
However. If the process is not CPU bound but e.g. disk bound (need to wait for the disk for read/write) or network bound (e.g. wait for the other server to answer) then they are switched out by the scheduler to make room for another process since they need to wait anyway. This is align with our situation where the requests are sent to opensearch (serverless) and most of the CPU time is spent on waiting.
The text was updated successfully, but these errors were encountered:
This is useful data. OSB runs in async mode and the limitation is not related to synchronous operation. However, the asyncio implementation apparently does not extract the best performance with a large number of clients and operating system-based context switching is necessary to get the best out of each core.
The available.cores parameter is a misnomer. Renaming it workers-per-core would be more appropriate. With additional data and instance type information, it may be possible to infer appropriate values for such a parameter.
Worked with @qiaoxux briefly last week to recreate the setup and try to reproduce the limitations noted in this issue. However, I was able to achieve the expected target throughput with the same EC2 instance (r5.24xlarge)
To better reproduce the issue, we will work closely together to ensure we have identical setups. As a side note, our recent tests have shown that we are able to achieve up to 15,000 ops/s with the same term query when running the workload with a m5d.8xlarge EC2 instance against a 20 node cluster with c5.4xlarge data nodes.
Background
Asyncio-based load generation mode was introduced into Rally/OSB as part of this proposal: An actorless load generator #852, which allows managing multiple clients per process based on async event loops. Previously, Rally can only create one client per process, which indicates a hard limit on the upper bound of supported clients and the whole system is just hanging if there are too many clients.
Rally/OSB didn’t change the actor model to actorless eventually, but following two async features were implemented as part of this issue:
Problem Statement
OSB determines number of workers based on number of available cores and doesn't support auto-scaling workers during testing:
opensearch-benchmark/osbenchmark/worker_coordinator/worker_coordinator.py
Lines 509 to 511 in e73664a
opensearch-benchmark/osbenchmark/worker_coordinator/worker_coordinator.py
Line 471 in e73664a
As shown above, the number of available cores can be overrode in the system config.
I tested on a 96-core instance (r5.24xlarge)
After overriding this value, I observed almost 10x throughput increase for search operations (I believe this can be applied for indexing operations as well):
available.cores: 96 cores (default). target-throughput: 100, 200, 300
available.cores: 1000 cores. target-throughput: 500, 1000, 1500, 2000, 2500, 3000
Proposed Solution
OSB should be able to dynamically increase number of workers to create sufficient clients so that target-throughput in the test procedure can be achieved.
Alternative Solution
Provide an option in
--workload-params
to specify how many workers should be created based on number of cores such as a integer value:workers-per-core
.This is a short term solution helping operators to better leverage on asyncio-based load generation to increase throughput. Instead of manually changing system config for each test run, adding a workload parameter for this can give more flexibility before long term solution is implemented and released.
Additional context
How many client do we need to achieve certain target throughput?
The number of clients needed to achieve a certain target throughput, given a single request latency, depends on various factors like network conditions, server capacity, and application specifics. To calculate this, we can use Little's Law, which states:
Assuming we want to achieve a target throughput X and have a single request latency of Y, rearranging the formula gives:
As OSB runs each request synchronously for metric collection:
How does expanding the number of workers leads to a boost in throughput?
If we start more processes (workers) than the available vCPUs, then a few processes need to share a vCPU. That means that a process is periodically switched out in the context switch by the scheduler. If the process is CPU bound, then having more processes than vCPUs will slow down the overall tasks processing.
However. If the process is not CPU bound but e.g. disk bound (need to wait for the disk for read/write) or network bound (e.g. wait for the other server to answer) then they are switched out by the scheduler to make room for another process since they need to wait anyway. This is align with our situation where the requests are sent to opensearch (serverless) and most of the CPU time is spent on waiting.
The text was updated successfully, but these errors were encountered: