Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Processing Functionality for create-workload #403

Conversation

AkshathRaghav
Copy link
Contributor

Description

Adding multi-process functionality that works concurrently.

Issues Resolved

#375

Testing

  • Tested if the create_workloads completes, and checks if it runs properly.

REQUEST FOR HELP:
Below, is the sample output for the create_workloads with multi-processing. The render_progress is not working as expected, I have added a lock() to not overload and print for every thread thats running, but it doesn't seem to be working. Please help if you can!

╭─aksha@Akshaths-PC ~/Workbench/opensearch-benchmark ‹AkshathRaghav/multi_process_index_extraction●› 
╰─$ opensearch-benchmark create-workload \
--workload=flights \
--target-host="https://127.0.0.1:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin'" \
--output-path=~/Workbench/Workloads \
--indices=opensearch_dashboards_sample_data_flights \
--number-of-docs opensearch_dashboards_sample_data_flights:2500 \
--client-options="timeout:300,use_ssl:true,verify_certs:false,basic_auth_user:'admin',basic_auth_password:'admin'" \


   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] Connected to OpenSearch cluster [opensearch-node1] version [2.11.0].

Extracting documents for index [opensearch_dashboards_sampl...     2/1000 docs [0.2% done]                       Extracting documents for index [opensearch_dashboards_sampl...     3/1000 docs [0.3% done]                       Extracting documents for index 
[opensearch_dashboards_sa...     251/1000 docs [25.1% done]                       Extracting documents for index [opensearch_dashboards_sa...     252/1000 docs [25.2% done]                       Extracting documents for index [opensearch_dashboards_sa...     627/1000 docs [62.7% done]                       Extracting documents for index [opensearch_dashboards_sa...     253/1000 docs [25.3% done]                       Extracting documents for index [opensearch_dashboards_sa...     501/1000 docs [50.1% done]                       Extracting documents for index [opensearch_dashboards_sa...     875/1000 docs [87.5% done]                       Extracting documents for index [opensearch_dashboards_sampl...     2/2500 docs [0.1% done]                       Extracting documents for index [opensearch_dashboards_sa...     626/2500 docs [25.0% done]                       Extracting documents for index [opensearch_dashboards_sampl...     4/2500 docs [0.2% done]                       Extracting documents for index [opensearch_dashboards_s...     1562/2500 docs [62.5% done]                       Extracting documents for index [opensearch_dashboards_s...     1874/2500 docs [75.0% done]                       Extracting documents for index [opensearch_dashboards_sa...     316/2500 docs [12.6% done]                       Extracting documents for index [opensearch_dashboards_s...     1250/2500 docs [50.0% done]                       Extracting documents for index [opensearch_dashboards_sa...     936/2500 docs [37.4% done]                       Extracting documents for index [opensearch_dashboards_s...     2496/2500 docs [99.8% done]
[INFO] Workload flights has been created. Run it with: opensearch-benchmark --workload-path=/home/aksha/Workbench/Workloads/flights

-------------------------------
[INFO] SUCCESS (took 5 seconds)
-------------------------------

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

client, index, out_path, start_doc, end_doc, total_docs, progress_message_suffix=""
):
"""
Extract documents in the range of start_doc and end_doc and write to induvidual files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: induvidual files --> individual files

Extract documents in the range of start_doc and end_doc and write to induvidual files

:param client: OpenSearch client used to extract data
:param index: Name of index to dump
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help newcomers understand better, we can redefine this as:
name of OpenSearch index to extract documents from

@IanHoang
Copy link
Collaborator

IanHoang commented Nov 1, 2023

Thanks for tackling this @AkshathRaghav!

We should still offer users the option to go with the original route and not perform concurrent dumping. Adding a parameter like --concurrent could easily allow users to switch between the two. Also, does this feature work when users want to extract 2+ indices? If so, could you test that out and provide it in the description? It'd be nice to include the following comparisons in the description to highlight the performance benefits:

  • single threaded extraction vs concurrent extraction of a single index
  • single threaded extraction vs concurrent extraction of 2+ indices

Some additional notes:

  1. For best practices, please provide a brief summary of your changes you're proposing within the PR description
  2. Please revise the following lint errors detected by CI Unittests:
osbenchmark/workload_generator/corpus.py:128:12: W1203: Use lazy % formatting in logging functions (logging-fstring-interpolation)
osbenchmark/workload_generator/corpus.py:33:0: C0411: standard import "from threading import Lock" should be placed before "from osbenchmark.utils import console" (wrong-import-order)

osbenchmark/workload_generator/corpus.py Outdated Show resolved Hide resolved
osbenchmark/workload_generator/corpus.py Outdated Show resolved Hide resolved
osbenchmark/workload_generator/corpus.py Outdated Show resolved Hide resolved
@gkamat
Copy link
Collaborator

gkamat commented Nov 3, 2023

@AkshathRaghav please also elaborate the title so it clear which component of OSB this is addressing. Thanks

@AkshathRaghav AkshathRaghav changed the title Multi-Processing Functionality Multi-Processing Functionality for create-workload Nov 3, 2023
@AkshathRaghav
Copy link
Contributor Author

Some benchmarks:

  • Single threaded extraction vs concurrent extraction of a single index
    • Concurrent
╰─$ opensearch-benchmark create-workload \                    
--workload=flights \
--target-host="https://127.0.0.1:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin'" \
--output-path=~/Workbench/Workloads \
--indices=opensearch_dashboards_sample_data_flights \
--number-of-docs=opensearch_dashboards_sample_data_flights:2500 \
--client-options="timeout:300,use_ssl:true,verify_certs:false,basic_auth_user:'admin',basic_auth_password:'admin'" \
> --concurrent

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] Connected to OpenSearch cluster [opensearch-node1] version [2.11.0].

Extracting documents from opensearch_dashboards_sample_data_flights [for test mode]: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 4939.46doc/s]
Extracting documents from opensearch_dashboards_sample_data_flights: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 6910.65doc/s]


[INFO] Workload flights has been created. Run it with: opensearch-benchmark --workload-path=/home/aksha/Workbench/Workloads/flights

-------------------------------
[INFO] SUCCESS (took 1 seconds)
-------------------------------
  • Single Threaded
╰─$ opensearch-benchmark create-workload \
--workload=flights \
--target-host="https://127.0.0.1:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin'" \
--output-path=~/Workbench/Workloads \
--indices=opensearch_dashboards_sample_data_flights \                                            
--number-of-docs=opensearch_dashboards_sample_data_flights:2500 \                                                 
--client-options="timeout:300,use_ssl:true,verify_certs:false,basic_auth_user:'admin',basic_auth_password:'admin'"

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] Connected to OpenSearch cluster [opensearch-node1] version [2.11.0].

Extracting documents from opensearch_dashboards_sample_data_flights [for test mode]: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 4484.50doc/s]
Extracting documents from opensearch_dashboards_sample_data_flights: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 4878.57doc/s]


[INFO] Workload flights has been created. Run it with: opensearch-benchmark --workload-path=/home/aksha/Workbench/Workloads/flights

-------------------------------
[INFO] SUCCESS (took 1 seconds)
-------------------------------
  • Single threaded extraction vs concurrent extraction of 2+ indices
    • Concurrent
╰─$ opensearch-benchmark create-workload \
--workload=flights \
--target-host="https://127.0.0.1:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin'" \
--output-path=~/Workbench/Workloads \
--indices=opensearch_dashboards_sample_data_flights,opensearch_dashboards_sample_data_ecommerce \
--number-of-docs=opensearch_dashboards_sample_data_flights:2500,opensearch_dashboards_sample_data_ecommerce:1500 \
--client-options="timeout:300,use_ssl:true,verify_certs:false,basic_auth_user:'admin',basic_auth_password:'admin'" \
--concurrent

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] Connected to OpenSearch cluster [opensearch-node1] version [2.11.0].

Extracting documents from opensearch_dashboards_sample_data_flights [for test mode]: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 4673.39doc/s]
Extracting documents from opensearch_dashboards_sample_data_flights: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 6522.05doc/s]
Extracting documents from opensearch_dashboards_sample_data_ecommerce [for test mode]: 100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 3324.74doc/s]
Extracting documents from opensearch_dashboards_sample_data_ecommerce: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:00<00:00, 3821.05doc/s]


[INFO] Workload flights has been created. Run it with: opensearch-benchmark --workload-path=/home/aksha/Workbench/Workloads/flights

-------------------------------
[INFO] SUCCESS (took 2 seconds)
-------------------------------
  • Single Threaded
╰─$ opensearch-benchmark create-workload \
--workload=flights \
--target-host="https://127.0.0.1:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin'" \
--output-path=~/Workbench/Workloads \
--indices=opensearch_dashboards_sample_data_flights,opensearch_dashboards_sample_data_ecommerce \
--number-of-docs=opensearch_dashboards_sample_data_flights:2500,opensearch_dashboards_sample_data_ecommerce:1500 \
--client-options="timeout:300,use_ssl:true,verify_certs:false,basic_auth_user:'admin',basic_auth_password:'admin'"  

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] Connected to OpenSearch cluster [opensearch-node1] version [2.11.0].

Extracting documents from opensearch_dashboards_sample_data_flights [for test mode]: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 4572.21doc/s]
Extracting documents from opensearch_dashboards_sample_data_flights: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 4899.36doc/s]
Extracting documents from opensearch_dashboards_sample_data_ecommerce [for test mode]: 100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2700.82doc/s]
Extracting documents from opensearch_dashboards_sample_data_ecommerce: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:00<00:00, 2632.83doc/s]


[INFO] Workload flights has been created. Run it with: opensearch-benchmark --workload-path=/home/aksha/Workbench/Workloads/flights

-------------------------------
[INFO] SUCCESS (took 2 seconds)
-------------------------------

All these methods run in under a second, so instead I've been comparing them with respect to extracting+dumping speed which is overall faster in the concurrent runs.

@gkamat I am working on making some parameters more configurable right now!

@AkshathRaghav
Copy link
Contributor Author

@gkamat I've made the requested changes. Currently working on changing unit tests.

╰─$ opensearch-benchmark create-workload --workload=flights --target-host="https://127.0.0.1:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin'" \
--output-path=~/Workbench/Workloads \
--indices=opensearch_dashboards_sample_data_flights,opensearch_dashboards_sample_data_ecommerce \
--number-of-docs=opensearch_dashboards_sample_data_flights:2500,opensearch_dashboards_sample_data_ecommerce:1500 \
--client-options="timeout:300,use_ssl:true,verify_certs:false,basic_auth_user:'admin',basic_auth_password:'admin'" \
--concurrent --threads=8 --bsize=50 --custom_dump_query=../test/custom_query.json


   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] Connected to OpenSearch cluster [opensearch-node1] version [2.11.0].

Extracting documents from opensearch_dashboards_sample_data_flights [for test mode]: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 6777.74doc/s]
Extracting documents from opensearch_dashboards_sample_data_flights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 9417.50doc/s]
Extracting documents from opensearch_dashboards_sample_data_ecommerce [for test mode]: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 5065.93doc/s]
Extracting documents from opensearch_dashboards_sample_data_ecommerce: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:00<00:00, 5928.33doc/s]


[INFO] Workload flights has been created. Run it with: opensearch-benchmark --workload-path=/home/aksha/Workbench/Workloads/flights

-------------------------------
[INFO] SUCCESS (took 1 seconds)
-------------------------------

AkshathRaghav and others added 8 commits November 8, 2023 16:39
Signed-off-by: AkshathRaghav <[email protected]>
Signed-off-by: AkshathRaghav <[email protected]>
…e for documents. Needs documentation update, and configurability

Signed-off-by: AkshathRaghav <[email protected]>
Signed-off-by: AkshathRaghav <[email protected]>
Signed-off-by: AkshathRaghav <[email protected]>
@AkshathRaghav AkshathRaghav force-pushed the AkshathRaghav/multi_process_index_extraction branch from b6ea5ca to a4bbb2d Compare November 8, 2023 21:39
Signed-off-by: AkshathRaghav <[email protected]>
@AkshathRaghav
Copy link
Contributor Author

@gkamat @IanHoang Hi! I think I've done most of what is required for this PR. Please let me know if everything is ok.
I've changed unit tests such that make it and make test work. I've checked the linting as well.

I tried adding the commit from the conflicting file, but security check still shows it. Not sure what to do on that front.
Another question I had was that the checks are not running for some reason... I'm not sure where I went wrong. Please let me know what I have to do to solve that!

@IanHoang
Copy link
Collaborator

IanHoang commented Nov 9, 2023

@gkamat @IanHoang Hi! I think I've done most of what is required for this PR. Please let me know if everything is ok. I've changed unit tests such that make it and make test work. I've checked the linting as well.

I tried adding the commit from the conflicting file, but security check still shows it. Not sure what to do on that front. Another question I had was that the checks are not running for some reason... I'm not sure where I went wrong. Please let me know what I have to do to solve that!

The checks aren't running since there are conflicts in benchmark.py. There were recent changes in benchmark.py that were merged into mainline this week. You can pull in the latest changes on your forked mainline and rebase these changes into the feature branch multi_processing. This will ensure that everything is up to date + your changes here are on top.

help="Batch size to use for dumping documents (default: false)",
)
create_workload_parser.add_argument(
"--custom_dump_query",
Copy link
Collaborator

@IanHoang IanHoang Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep it consistent, --custom_dump_query --> --custom-dump-query

help="Number of threads for dumping documents from indices (default: false)",
)
create_workload_parser.add_argument(
"--bsize",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it's worth changing this to --batch-size to make it easier to understand on first glance and to keep it consistent with every other parameter that spells out the option. One thing we can do in the future is add an abbreviated option for specific parameters. For example, users can either use --batch-size or -b.

"--threads",
type=positive_number,
default=8,
help="Number of threads for dumping documents from indices (default: false)",
Copy link
Collaborator

@IanHoang IanHoang Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, we should elaborate the help statement:

Number of threads used to extract documents from indices and dump to workload. --threads parameter should be used with --concurrent flag. By default, --threads is disabled. If no argument is provided to --threads parameter, number of threads defaults to 8. 

@IanHoang IanHoang closed this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants