Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Adding capabilities to introduce randomness in workload queries #443

Closed
sgup432 opened this issue Jan 22, 2024 · 2 comments · Fixed by #455 · May be fixed by peteralfonsi/opensearch-benchmark#5
Closed

[RFC] Adding capabilities to introduce randomness in workload queries #443

sgup432 opened this issue Jan 22, 2024 · 2 comments · Fixed by #455 · May be fixed by peteralfonsi/opensearch-benchmark#5
Labels
enhancement New feature or request

Comments

@sgup432
Copy link

sgup432 commented Jan 22, 2024

Is your feature request related to a problem? Please describe.

I had added a proposal here in OSB workloads(opensearch-project/opensearch-benchmark-workloads#152), shifting the conversation here instead as per suggestions.

As of now, opensearch workload mostly deals with static queries ie the values are fixed. For example in nyc_taxis(search recommended workload), we have a set of static queries which we use to perform benchmark. But this doesn't help in simulating real world scenarios where data distribution and access patterns are not predictable. Plus there is not way to use nyc_taxis to run test with caches(like request cache) turned on.

But considering we have features like Tiered caching coming up(opensearch-project/OpenSearch#10024), we need capabilities to generate some workload with some dynamic queries(randomization in values) so that we can test with cache turned on as well. We wrote some custom logic on top of nyc_taxis to create some randomization.

Describe the solution you'd like

I recommend to have a parameter via which we can introduce some randomization in nyc_taxis(for example) query values which would be useful:

To perform benchmark with caches turned on. Via this parameter we can generate desired cache hits/misses.
To have some kind of randomization in workload so that it not very predictable.
We can have a parameter via which users can generate X number of repeated queries and rest unique queries. Where X>=0.

This parameter can be enabled on a demand basis if needed. I believe it might also help other workloads as well and not just nyc_taxis.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

@sgup432 sgup432 added the enhancement New feature or request label Jan 22, 2024
@peteralfonsi
Copy link
Contributor

I have a working version of this for nyc_taxis only at opensearch-project/opensearch-benchmark-workloads#154. We define new param-source functions in workload.py, which pull randomized values from a common function (get_values). Then we define operations using these param-sources in operations/default.json and add them to the schedule in test_procedures/default.json. We also define "value generator" functions to provide reasonable values depending on what workload/field is being searched.

But to do this generically for all workloads, it would be messy, since we would have to move all the actual query definitions to workload.py and change operations/default.json and test_procedures/default.json for each workload.

Instead, what if we added a new WorkloadProcessor in opensearch-benchmark's loader.py? The workloads in opensearch-benchmark-workloads could stay the same, with one fixed query per operation. There could be some new parameter, like enable-randomization. If it was true, the processor could iterate through the tasks in the workload, and change the param_source in task.operation. The new param_source would be like the original, but query_body would be drawn from something like the get_values function from before. This would probably require changes to the TestProcedure and Task classes to allow changing these fields. We would still have to define the "value generator" functions for each workload but I don't think there's a way around that.

I'm not very familiar with the codebase so I'm not sure if this idea would work. Please let me know what you think!

@peteralfonsi
Copy link
Contributor

peteralfonsi commented Jan 24, 2024

I've implemented the WorkloadProcessor and ParamSource part of this and it seems fairly straightforward. Only issue is passing in the "standard value" functions, which are just some mapping from (index's field name) -> (Python function that gives reasonable gte/lte pairs for that field).

Right now I'm thinking it will look for some standard_values.py file in the workload's mapping_dir, and this will be picked up by the WorkloadFileReader and added to the workload itself, so it'll be available to the WorkloadProcessor. If it doesn't find this file, randomization can't be used for that workload. Alternatively we could extend the register() function in workload.py to allow registering standard value functions, and define them within workload.py itself.

(Edit: it seems like adding a new registry.register_standard_value_source() function in workload.py's register() was the simplest way, let me know if this is bad for some reason)

@gkamat @IanHoang does something like this seem reasonable? I don't fully understand this part of the code so maybe my approach is totally off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
3 participants