Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Ability to dynamically spin up more workers to increase benchmarking throughput for indexing and search #417

Open
qiaoxux opened this issue Nov 20, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@qiaoxux
Copy link

qiaoxux commented Nov 20, 2023

Background

Asyncio-based load generation mode was introduced into Rally/OSB as part of this proposal: An actorless load generator #852, which allows managing multiple clients per process based on async event loops. Previously, Rally can only create one client per process, which indicates a hard limit on the upper bound of supported clients and the whole system is just hanging if there are too many clients.

Rally/OSB didn’t change the actor model to actorless eventually, but following two async features were implemented as part of this issue:

  1. Add an asyncio-based load generator (#916, #935) - this is to increase # of clients for single-core instance
  2. Allow to use significantly more clients (#944) - this is to increase # of client for multi-core instance

Problem Statement

OSB determines number of workers based on number of available cores and doesn't support auto-scaling workers during testing:

def num_cores(cfg):
return int(cfg.opts("system", "available.cores", mandatory=False,
default_value=multiprocessing.cpu_count()))

self.children = [self._create_task_executor() for _ in range(num_cores(self.cfg))]

As shown above, the number of available cores can be overrode in the system config.

I tested on a 96-core instance (r5.24xlarge)

Instance Size<br> vCPU Memory (GiB) Instance Storage (GB) Networking Performance (Gbps) EBS Bandwidth (Mbps)
r5.24xlarge 96 768 EBS-Only 25 19,000

After overriding this value, I observed almost 10x throughput increase for search operations (I believe this can be applied for indexing operations as well):

  • available.cores: 96 cores (default). target-throughput: 100, 200, 300
    Screenshot 2023-11-09 at 10 52 24 PM

  • available.cores: 1000 cores. target-throughput: 500, 1000, 1500, 2000, 2500, 3000
    Screenshot 2023-11-15 at 12 54 38 PM

Proposed Solution

OSB should be able to dynamically increase number of workers to create sufficient clients so that target-throughput in the test procedure can be achieved.

Alternative Solution

Provide an option in --workload-params to specify how many workers should be created based on number of cores such as a integer value: workers-per-core.

This is a short term solution helping operators to better leverage on asyncio-based load generation to increase throughput. Instead of manually changing system config for each test run, adding a workload parameter for this can give more flexibility before long term solution is implemented and released.

Additional context

How many client do we need to achieve certain target throughput?

The number of clients needed to achieve a certain target throughput, given a single request latency, depends on various factors like network conditions, server capacity, and application specifics. To calculate this, we can use Little's Law, which states:

Throughput = Number of Requests/Response Time


Assuming we want to achieve a target throughput X and have a single request latency of Y, rearranging the formula gives:

Number of Requests = X×Y

As OSB runs each request synchronously for metric collection:

Number of Clients = Number of Requests

How does expanding the number of workers leads to a boost in throughput?
If we start more processes (workers) than the available vCPUs, then a few processes need to share a vCPU. That means that a process is periodically switched out in the context switch by the scheduler. If the process is CPU bound, then having more processes than vCPUs will slow down the overall tasks processing.

However. If the process is not CPU bound but e.g. disk bound (need to wait for the disk for read/write) or network bound (e.g. wait for the other server to answer) then they are switched out by the scheduler to make room for another process since they need to wait anyway. This is align with our situation where the requests are sent to opensearch (serverless) and most of the CPU time is spent on waiting.

@qiaoxux qiaoxux added the enhancement New feature or request label Nov 20, 2023
@gkamat gkamat removed the untriaged label Nov 21, 2023
@gkamat
Copy link
Collaborator

gkamat commented Nov 21, 2023

This is useful data. OSB runs in async mode and the limitation is not related to synchronous operation. However, the asyncio implementation apparently does not extract the best performance with a large number of clients and operating system-based context switching is necessary to get the best out of each core.

The available.cores parameter is a misnomer. Renaming it workers-per-core would be more appropriate. With additional data and instance type information, it may be possible to infer appropriate values for such a parameter.

@IanHoang
Copy link
Collaborator

IanHoang commented Oct 2, 2024

Worked with @qiaoxux briefly last week to recreate the setup and try to reproduce the limitations noted in this issue. However, I was able to achieve the expected target throughput with the same EC2 instance (r5.24xlarge)
image

To better reproduce the issue, we will work closely together to ensure we have identical setups. As a side note, our recent tests have shown that we are able to achieve up to 15,000 ops/s with the same term query when running the workload with a m5d.8xlarge EC2 instance against a 20 node cluster with c5.4xlarge data nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🏗 In progress
Development

No branches or pull requests

3 participants