[RFC] Adding capabilities to introduce randomness in workload queries #443

sgup432 · 2024-01-22T18:07:17Z

Is your feature request related to a problem? Please describe.

I had added a proposal here in OSB workloads(opensearch-project/opensearch-benchmark-workloads#152), shifting the conversation here instead as per suggestions.

As of now, opensearch workload mostly deals with static queries ie the values are fixed. For example in nyc_taxis(search recommended workload), we have a set of static queries which we use to perform benchmark. But this doesn't help in simulating real world scenarios where data distribution and access patterns are not predictable. Plus there is not way to use nyc_taxis to run test with caches(like request cache) turned on.

But considering we have features like Tiered caching coming up(opensearch-project/OpenSearch#10024), we need capabilities to generate some workload with some dynamic queries(randomization in values) so that we can test with cache turned on as well. We wrote some custom logic on top of nyc_taxis to create some randomization.

Describe the solution you'd like

I recommend to have a parameter via which we can introduce some randomization in nyc_taxis(for example) query values which would be useful:

To perform benchmark with caches turned on. Via this parameter we can generate desired cache hits/misses.
To have some kind of randomization in workload so that it not very predictable.
We can have a parameter via which users can generate X number of repeated queries and rest unique queries. Where X>=0.

This parameter can be enabled on a demand basis if needed. I believe it might also help other workloads as well and not just nyc_taxis.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

peteralfonsi · 2024-01-22T20:54:10Z

I have a working version of this for nyc_taxis only at opensearch-project/opensearch-benchmark-workloads#154. We define new param-source functions in workload.py, which pull randomized values from a common function (get_values). Then we define operations using these param-sources in operations/default.json and add them to the schedule in test_procedures/default.json. We also define "value generator" functions to provide reasonable values depending on what workload/field is being searched.

But to do this generically for all workloads, it would be messy, since we would have to move all the actual query definitions to workload.py and change operations/default.json and test_procedures/default.json for each workload.

Instead, what if we added a new WorkloadProcessor in opensearch-benchmark's loader.py? The workloads in opensearch-benchmark-workloads could stay the same, with one fixed query per operation. There could be some new parameter, like enable-randomization. If it was true, the processor could iterate through the tasks in the workload, and change the param_source in task.operation. The new param_source would be like the original, but query_body would be drawn from something like the get_values function from before. This would probably require changes to the TestProcedure and Task classes to allow changing these fields. We would still have to define the "value generator" functions for each workload but I don't think there's a way around that.

I'm not very familiar with the codebase so I'm not sure if this idea would work. Please let me know what you think!

peteralfonsi · 2024-01-24T20:15:33Z

I've implemented the WorkloadProcessor and ParamSource part of this and it seems fairly straightforward. Only issue is passing in the "standard value" functions, which are just some mapping from (index's field name) -> (Python function that gives reasonable gte/lte pairs for that field).

Right now I'm thinking it will look for some standard_values.py file in the workload's mapping_dir, and this will be picked up by the WorkloadFileReader and added to the workload itself, so it'll be available to the WorkloadProcessor. If it doesn't find this file, randomization can't be used for that workload. Alternatively we could extend the register() function in workload.py to allow registering standard value functions, and define them within workload.py itself.

(Edit: it seems like adding a new registry.register_standard_value_source() function in workload.py's register() was the simplest way, let me know if this is bad for some reason)

@gkamat @IanHoang does something like this seem reasonable? I don't fully understand this part of the code so maybe my approach is totally off.

sgup432 added the enhancement New feature or request label Jan 22, 2024

github-actions bot added the untriaged label Jan 22, 2024

gkamat removed the untriaged label Jan 26, 2024

This was referenced Feb 1, 2024

Test PR for github actions peteralfonsi/opensearch-benchmark#5

Open

Adds controllable randomization to range queries in workloads #455

Merged

IanHoang closed this as completed in #455 Feb 7, 2024

sgup432 added this to Performance Roadmap Feb 8, 2024

sgup432 moved this to Done in Performance Roadmap Feb 8, 2024

IanHoang mentioned this issue Oct 29, 2024

[PROPOSAL] Adding capabilities to introduce randomness in workload queries opensearch-project/opensearch-benchmark-workloads#152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Adding capabilities to introduce randomness in workload queries #443

[RFC] Adding capabilities to introduce randomness in workload queries #443

sgup432 commented Jan 22, 2024

peteralfonsi commented Jan 22, 2024

peteralfonsi commented Jan 24, 2024 •

edited

Loading

[RFC] Adding capabilities to introduce randomness in workload queries #443

[RFC] Adding capabilities to introduce randomness in workload queries #443

Comments

sgup432 commented Jan 22, 2024

peteralfonsi commented Jan 22, 2024

peteralfonsi commented Jan 24, 2024 • edited Loading

peteralfonsi commented Jan 24, 2024 •

edited

Loading