-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Processing Functionality for create-workload #403
Multi-Processing Functionality for create-workload #403
Conversation
client, index, out_path, start_doc, end_doc, total_docs, progress_message_suffix="" | ||
): | ||
""" | ||
Extract documents in the range of start_doc and end_doc and write to induvidual files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: induvidual files
--> individual files
Extract documents in the range of start_doc and end_doc and write to induvidual files | ||
|
||
:param client: OpenSearch client used to extract data | ||
:param index: Name of index to dump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help newcomers understand better, we can redefine this as:
name of OpenSearch index to extract documents from
Thanks for tackling this @AkshathRaghav! We should still offer users the option to go with the original route and not perform concurrent dumping. Adding a parameter like
Some additional notes:
|
@AkshathRaghav please also elaborate the title so it clear which component of OSB this is addressing. Thanks |
Some benchmarks:
All these methods run in under a second, so instead I've been comparing them with respect to extracting+dumping speed which is overall faster in the concurrent runs. @gkamat I am working on making some parameters more configurable right now! |
@gkamat I've made the requested changes. Currently working on changing unit tests.
|
Signed-off-by: AkshathRaghav <[email protected]>
Signed-off-by: AkshathRaghav <[email protected]>
Signed-off-by: AkshathRaghav <[email protected]>
Signed-off-by: AkshathRaghav <[email protected]>
…e for documents. Needs documentation update, and configurability Signed-off-by: AkshathRaghav <[email protected]>
Signed-off-by: AkshathRaghav <[email protected]>
…ensearch-project#407) Signed-off-by: vivek palakkat <[email protected]> Signed-off-by: AkshathRaghav <[email protected]>
Signed-off-by: AkshathRaghav <[email protected]>
b6ea5ca
to
a4bbb2d
Compare
Signed-off-by: AkshathRaghav <[email protected]>
@gkamat @IanHoang Hi! I think I've done most of what is required for this PR. Please let me know if everything is ok. I tried adding the commit from the conflicting file, but security check still shows it. Not sure what to do on that front. |
Signed-off-by: AkshathRaghav <[email protected]>
The checks aren't running since there are conflicts in |
help="Batch size to use for dumping documents (default: false)", | ||
) | ||
create_workload_parser.add_argument( | ||
"--custom_dump_query", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep it consistent, --custom_dump_query
--> --custom-dump-query
help="Number of threads for dumping documents from indices (default: false)", | ||
) | ||
create_workload_parser.add_argument( | ||
"--bsize", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it's worth changing this to --batch-size
to make it easier to understand on first glance and to keep it consistent with every other parameter that spells out the option. One thing we can do in the future is add an abbreviated option for specific parameters. For example, users can either use --batch-size
or -b
.
"--threads", | ||
type=positive_number, | ||
default=8, | ||
help="Number of threads for dumping documents from indices (default: false)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, we should elaborate the help statement:
Number of threads used to extract documents from indices and dump to workload. --threads parameter should be used with --concurrent flag. By default, --threads is disabled. If no argument is provided to --threads parameter, number of threads defaults to 8.
Description
Adding multi-process functionality that works concurrently.
Issues Resolved
#375
Testing
REQUEST FOR HELP:
Below, is the sample output for the create_workloads with multi-processing. The render_progress is not working as expected, I have added a lock() to not overload and print for every thread thats running, but it doesn't seem to be working. Please help if you can!
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.