-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Adding capabilities to introduce randomness in workload queries #443
[RFC] Adding capabilities to introduce randomness in workload queries #443
Comments
I have a working version of this for nyc_taxis only at opensearch-project/opensearch-benchmark-workloads#154. We define new param-source functions in workload.py, which pull randomized values from a common function ( But to do this generically for all workloads, it would be messy, since we would have to move all the actual query definitions to workload.py and change operations/default.json and test_procedures/default.json for each workload. Instead, what if we added a new WorkloadProcessor in opensearch-benchmark's loader.py? The workloads in opensearch-benchmark-workloads could stay the same, with one fixed query per operation. There could be some new parameter, like I'm not very familiar with the codebase so I'm not sure if this idea would work. Please let me know what you think! |
I've implemented the WorkloadProcessor and ParamSource part of this and it seems fairly straightforward. Only issue is passing in the "standard value" functions, which are just some mapping from (index's field name) -> (Python function that gives reasonable gte/lte pairs for that field). Right now I'm thinking it will look for some standard_values.py file in the workload's mapping_dir, and this will be picked up by the WorkloadFileReader and added to the workload itself, so it'll be available to the WorkloadProcessor. If it doesn't find this file, randomization can't be used for that workload. Alternatively we could extend the register() function in workload.py to allow registering standard value functions, and define them within workload.py itself. (Edit: it seems like adding a new registry.register_standard_value_source() function in workload.py's register() was the simplest way, let me know if this is bad for some reason) @gkamat @IanHoang does something like this seem reasonable? I don't fully understand this part of the code so maybe my approach is totally off. |
Is your feature request related to a problem? Please describe.
I had added a proposal here in OSB workloads(opensearch-project/opensearch-benchmark-workloads#152), shifting the conversation here instead as per suggestions.
As of now, opensearch workload mostly deals with static queries ie the values are fixed. For example in nyc_taxis(search recommended workload), we have a set of static queries which we use to perform benchmark. But this doesn't help in simulating real world scenarios where data distribution and access patterns are not predictable. Plus there is not way to use nyc_taxis to run test with caches(like request cache) turned on.
But considering we have features like Tiered caching coming up(opensearch-project/OpenSearch#10024), we need capabilities to generate some workload with some dynamic queries(randomization in values) so that we can test with cache turned on as well. We wrote some custom logic on top of nyc_taxis to create some randomization.
Describe the solution you'd like
I recommend to have a parameter via which we can introduce some randomization in nyc_taxis(for example) query values which would be useful:
To perform benchmark with caches turned on. Via this parameter we can generate desired cache hits/misses.
To have some kind of randomization in workload so that it not very predictable.
We can have a parameter via which users can generate X number of repeated queries and rest unique queries. Where X>=0.
This parameter can be enabled on a demand basis if needed. I believe it might also help other workloads as well and not just nyc_taxis.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: