You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe
Finding the maximum search throughput for an OpenSearch cluster is an important benchmarking scenario in vector search. Currently the only way to figure out the maximum search throughput is to perform multiple benchmark runs with different search client settings (e.g. search_clients = 3, search_clients = 5, ...). Waiting for a run to conclude, changing the config, and rerunning OSB with the associated startup time is tedious. Ideally the maximum throughput could be found automatically.
Describe the solution you'd like
Users can provide a list search_clients_list that is passed to the search operation. A task is generated for each client setting in the list, and these tasks are executed sequentially in the typical fashion.
For instance, a user might specify "search_clients_list": [1, 5, 10, 12] in their parameters.
Then OSB will schedule search tasks with 1, 5, 10, and 12 clients. The final benchmark results will look something like the following:
A more permanent solution will use the findings from the Scaling Investigation. A better user experience might be specifying a min_number_clients and max_number_clients or a find_maximum_throughput option. From my initial investigation these options are challenging with OSB since the schedule is generated at setting load time (loader.py) and throughput is calculated at the end of task runs. It's possible scheduling could occur after a join in worker_coordinator.py, but I haven't looked into this thoroughly.
However, even this temporary fix will help cut operation time and make benchmarking more intuitive.
Describe alternatives you've considered
There is a concept of a parallel schedule in OSB. Theoretically we could write a parallel schedule with a list of vector search tasks, with each task having a different number of clients and compare the throughput here. However we would like to isolate as many variables as possible and having multiple sets of clients query the same cluster at the same time might cause different throughput levels compared to running each client scenario sequentially.
Is your feature request related to a problem? Please describe
Finding the maximum search throughput for an OpenSearch cluster is an important benchmarking scenario in vector search. Currently the only way to figure out the maximum search throughput is to perform multiple benchmark runs with different search client settings (e.g.
search_clients = 3
,search_clients = 5
, ...). Waiting for a run to conclude, changing the config, and rerunning OSB with the associated startup time is tedious. Ideally the maximum throughput could be found automatically.Describe the solution you'd like
Users can provide a list
search_clients_list
that is passed to the search operation. A task is generated for each client setting in the list, and these tasks are executed sequentially in the typical fashion.For instance, a user might specify
"search_clients_list": [1, 5, 10, 12]
in their parameters.Then OSB will schedule search tasks with 1, 5, 10, and 12 clients. The final benchmark results will look something like the following:
A more permanent solution will use the findings from the Scaling Investigation. A better user experience might be specifying a
min_number_clients
andmax_number_clients
or afind_maximum_throughput
option. From my initial investigation these options are challenging with OSB since the schedule is generated at setting load time (loader.py
) and throughput is calculated at the end of task runs. It's possible scheduling could occur after a join inworker_coordinator.py
, but I haven't looked into this thoroughly.However, even this temporary fix will help cut operation time and make benchmarking more intuitive.
Describe alternatives you've considered
There is a concept of a parallel schedule in OSB. Theoretically we could write a parallel schedule with a list of vector search tasks, with each task having a different number of clients and compare the throughput here. However we would like to isolate as many variables as possible and having multiple sets of clients query the same cluster at the same time might cause different throughput levels compared to running each client scenario sequentially.
Additional context
Related issues:
#373
#505
#555
OpenSearch Clients # 27
The text was updated successfully, but these errors were encountered: