[FEATURE] Ability to dynamically spin up more workers to increase benchmarking throughput for indexing and search #417

qiaoxux · 2023-11-20T06:30:57Z

Background

Asyncio-based load generation mode was introduced into Rally/OSB as part of this proposal: An actorless load generator #852, which allows managing multiple clients per process based on async event loops. Previously, Rally can only create one client per process, which indicates a hard limit on the upper bound of supported clients and the whole system is just hanging if there are too many clients.

Rally/OSB didn’t change the actor model to actorless eventually, but following two async features were implemented as part of this issue:

Add an asyncio-based load generator (#916, #935) - this is to increase # of clients for single-core instance
Allow to use significantly more clients (#944) - this is to increase # of client for multi-core instance

Problem Statement

OSB determines number of workers based on number of available cores and doesn't support auto-scaling workers during testing:

opensearch-benchmark/osbenchmark/worker_coordinator/worker_coordinator.py

Lines 509 to 511 in e73664a

    
           def num_cores(cfg): 
        
               return int(cfg.opts("system", "available.cores", mandatory=False, 
        
                                    default_value=multiprocessing.cpu_count()))

opensearch-benchmark/osbenchmark/worker_coordinator/worker_coordinator.py

Line 471 in e73664a

    
           self.children = [self._create_task_executor() for _ in range(num_cores(self.cfg))]

As shown above, the number of available cores can be overrode in the system config.

I tested on a 96-core instance (r5.24xlarge)

Instance Size<br>	vCPU	Memory (GiB)	Instance Storage (GB)	Networking Performance (Gbps)	EBS Bandwidth (Mbps)
r5.24xlarge	96	768	EBS-Only	25	19,000

After overriding this value, I observed almost 10x throughput increase for search operations (I believe this can be applied for indexing operations as well):

available.cores: 96 cores (default). target-throughput: 100, 200, 300
available.cores: 1000 cores. target-throughput: 500, 1000, 1500, 2000, 2500, 3000

Proposed Solution

OSB should be able to dynamically increase number of workers to create sufficient clients so that target-throughput in the test procedure can be achieved.

Alternative Solution

Provide an option in --workload-params to specify how many workers should be created based on number of cores such as a integer value: workers-per-core.

This is a short term solution helping operators to better leverage on asyncio-based load generation to increase throughput. Instead of manually changing system config for each test run, adding a workload parameter for this can give more flexibility before long term solution is implemented and released.

Additional context

How many client do we need to achieve certain target throughput?

The number of clients needed to achieve a certain target throughput, given a single request latency, depends on various factors like network conditions, server capacity, and application specifics. To calculate this, we can use Little's Law, which states:

Throughput = Number of Requests/Response Time

Assuming we want to achieve a target throughput X and have a single request latency of Y, rearranging the formula gives:

Number of Requests = X×Y

As OSB runs each request synchronously for metric collection:

Number of Clients = Number of Requests

How does expanding the number of workers leads to a boost in throughput?
If we start more processes (workers) than the available vCPUs, then a few processes need to share a vCPU. That means that a process is periodically switched out in the context switch by the scheduler. If the process is CPU bound, then having more processes than vCPUs will slow down the overall tasks processing.

However. If the process is not CPU bound but e.g. disk bound (need to wait for the disk for read/write) or network bound (e.g. wait for the other server to answer) then they are switched out by the scheduler to make room for another process since they need to wait anyway. This is align with our situation where the requests are sent to opensearch (serverless) and most of the CPU time is spent on waiting.

The text was updated successfully, but these errors were encountered:

gkamat · 2023-11-21T19:55:18Z

This is useful data. OSB runs in async mode and the limitation is not related to synchronous operation. However, the asyncio implementation apparently does not extract the best performance with a large number of clients and operating system-based context switching is necessary to get the best out of each core.

The available.cores parameter is a misnomer. Renaming it workers-per-core would be more appropriate. With additional data and instance type information, it may be possible to infer appropriate values for such a parameter.

IanHoang · 2024-10-02T20:41:42Z

Worked with @qiaoxux briefly last week to recreate the setup and try to reproduce the limitations noted in this issue. However, I was able to achieve the expected target throughput with the same EC2 instance (r5.24xlarge)

To better reproduce the issue, we will work closely together to ensure we have identical setups. As a side note, our recent tests have shown that we are able to achieve up to 15,000 ops/s with the same term query when running the workload with a m5d.8xlarge EC2 instance against a 20 node cluster with c5.4xlarge data nodes.

qiaoxux added the enhancement New feature or request label Nov 20, 2023

github-actions bot added the untriaged label Nov 20, 2023

gkamat removed the untriaged label Nov 21, 2023

IanHoang mentioned this issue Jun 18, 2024

[RFC]: Scale-Up Improvements on Single Load Generation Host #555

Closed

IanHoang mentioned this issue Jul 25, 2024

[META] User issues related to Scaling Clients in OSB #593

Open

github-project-automation bot added this to OpenSearch Benchmark Roadmap Aug 30, 2024

github-project-automation bot moved this to Roadmap Project Backlog in OpenSearch Benchmark Roadmap Aug 30, 2024

IanHoang added this to Engineering Effectiveness Board Sep 25, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Sep 25, 2024

IanHoang moved this from 🆕 New to 🏗 In progress in Engineering Effectiveness Board Sep 25, 2024

IanHoang self-assigned this Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Ability to dynamically spin up more workers to increase benchmarking throughput for indexing and search #417

[FEATURE] Ability to dynamically spin up more workers to increase benchmarking throughput for indexing and search #417

qiaoxux commented Nov 20, 2023 •

edited

Loading

gkamat commented Nov 21, 2023

IanHoang commented Oct 2, 2024

[FEATURE] Ability to dynamically spin up more workers to increase benchmarking throughput for indexing and search #417

[FEATURE] Ability to dynamically spin up more workers to increase benchmarking throughput for indexing and search #417

Comments

qiaoxux commented Nov 20, 2023 • edited Loading

Background

Problem Statement

Proposed Solution

Alternative Solution

Additional context

gkamat commented Nov 21, 2023

IanHoang commented Oct 2, 2024

qiaoxux commented Nov 20, 2023 •

edited

Loading