Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK) #206

Open
rudziankou opened this issue Aug 16, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@rudziankou
Copy link

rudziankou commented Aug 16, 2022

Hi folks, I ran the benchmark against an existing OpenSearch 2.2.1 cluster and got the following error:

2022-10-25 17:24:05,778 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO Telling worker_coordinator to start benchmark.
2022-10-25 17:24:05,779 ActorAddr-(T|:53454)/PID:51383 osbenchmark.worker_coordinator.worker_coordinator INFO Benchmark is about to start.
2022-10-25 17:24:05,780 ActorAddr-(T|:53454)/PID:51383 osbenchmark.worker_coordinator.worker_coordinator INFO Attaching cluster-level telemetry devices.
2022-10-25 17:24:06,801 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO Received a benchmark failure from [ActorAddr-(T|:53454)] and will forward it now.
2022-10-25 17:24:06,699 ActorAddr-(T|:53454)/PID:51383 osbenchmark.telemetry INFO JvmStatsSummary on benchmark start
2022-10-25 17:24:06,783 ActorAddr-(T|:53454)/PID:51383 osbenchmark.actor ERROR Error in worker_coordinator
Traceback (most recent call last):

78e55e46-f20f-455e-8969-2d1685e64167 File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/actor.py", line 92, in guard
return f(self, msg, sender)

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 265, in receiveMsg_StartBenchmark
self.coordinator.start_benchmark()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 657, in start_benchmark
self.telemetry.on_benchmark_start()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/telemetry.py", line 85, in on_benchmark_start
device.on_benchmark_start()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/telemetry.py", line 1381, in on_benchmark_start
self.jvm_stats_per_node = self.jvm_stats()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/telemetry.py", line 1440, in jvm_stats
old_gen_collection_time = gc["old"]["collection_time_in_millis"]

KeyError: 'old'

2022-10-25 17:24:06,797 ActorAddr-(T|:53454)/PID:51383 osbenchmark.actor INFO A workload preparator has exited.
2022-10-25 17:24:06,805 -not-actor-/PID:51345 osbenchmark.test_execution_orchestrator ERROR A benchmark failure has occurred
2022-10-25 17:24:06,806 -not-actor-/PID:51345 osbenchmark.test_execution_orchestrator INFO Telling benchmark actor to exit.
2022-10-25 17:24:06,807 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring).
2022-10-25 17:24:06,808 ActorAddr-(T|:53454)/PID:51383 osbenchmark.actor INFO Main worker_coordinator received ActorExitRequest and will terminate all load generators.
2022-10-25 17:24:06,810 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:53454)] (ignoring).
2022-10-25 17:24:06,809 ActorAddr-(T|:53451)/PID:51382 osbenchmark.actor INFO BuilderActor#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:53432)])
2022-10-25 17:24:06,810 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:53451)] (ignoring).
2022-10-25 17:24:09,812 -not-actor-/PID:51345 osbenchmark.Benchmark INFO Attempting to shutdown internal actor system.
2022-10-25 17:24:09,819 -not-actor-/PID:51362 root INFO ActorSystem Logging Shutdown
2022-10-25 17:24:09,843 -not-actor-/PID:51361 root INFO ---- Actor System shutdown
2022-10-25 17:24:09,846 -not-actor-/PID:51345 osbenchmark.benchmark INFO Actor system is still running. Waiting...
2022-10-25 17:24:10,853 -not-actor-/PID:51345 osbenchmark.benchmark INFO Shutdown completed.
2022-10-25 17:24:10,854 -not-actor-/PID:51345 osbenchmark.benchmark ERROR Cannot run subcommand [execute_test].
Traceback (most recent call last):
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/benchmark.py", line 893, in dispatch_sub_command
execute_test(cfg, args.kill_running_processes)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/benchmark.py", line 661, in execute_test
with_actor_system(test_execution_orchestrator.run, cfg)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/benchmark.py", line 688, in with_actor_system
runnable(cfg)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 379, in run
raise e
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 376, in run
pipeline(cfg)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 69, in call
self.target(cfg)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 314, in benchmark_only
return execute_test(cfg, external=True)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 273, in execute_test
raise exceptions.BenchmarkError(result.message, result.cause)
osbenchmark.exceptions.BenchmarkError: Error in worker_coordinator ('old')


Here is a command that I run:
opensearch-benchmark execute_test --workload nyc_taxis --pipeline benchmark-only --target-hosts "host01:8900" --client-options "verify_certs:false,use_ssl:true,basic_auth_user:admin,basic_auth_password:admin"


It looks like the issue is here. I see some mismatch with _nodes/stats output:
https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/telemetry.py#L1440-L1443


_nodes/stats output:
"jvm": {
"timestamp": 1666719053364,
"uptime_in_millis": 1712255684,
"mem": {
"heap_used_in_bytes": 13321109504,
"heap_used_percent": 79,
"heap_committed_in_bytes": 16682844160,
"heap_max_in_bytes": 16682844160,
"non_heap_used_in_bytes": 3457548288,
"non_heap_committed_in_bytes": 3457548288,
"pools": {}
},
"threads": {
"count": 67,
"peak_count": 70
},
"gc": {
"collectors": {
"GPGC New": {
"collection_count": 85,
"collection_time_in_millis": 6904
},
"GPGC Old": {
"collection_count": 85,
"collection_time_in_millis": 52220
}
}
},
"buffer_pools": {
"mapped": {
"count": 0,
"used_in_bytes": 0,
"total_capacity_in_bytes": 0
},
"direct": {
"count": 20,
"used_in_bytes": 8463983,
"total_capacity_in_bytes": 8463982
}
},
"classes": {
"current_loaded_count": 20123,
"total_loaded_count": 22803,
"total_unloaded_count": 2680
}
}

@IanHoang @travisbenedict guys, could you please check?

@rudziankou
Copy link
Author

I deployed a single node cluster on my local and compared the /_nodes/stats/jvm query outputs:
Existing cluster:
"gc": {
"collectors": {
"GPGC New": {
"collection_count": 6163,
"collection_time_in_millis": 1739134
},
"GPGC Old": {
"collection_count": 1095,
"collection_time_in_millis": 1544272
}
}
}
New single node local cluster
"gc": {
"collectors": {
"young": {
"collection_count": 9,
"collection_time_in_millis": 203
},
"old": {
"collection_count": 0,
"collection_time_in_millis": 0
}
}
}

The existing cluster is running on Zing JDK. The local cluster is running on OpenJDK. Some metrics have different names in Zing. That's why Benchmark is failing for the existing clusters.

@IanHoang
Copy link
Collaborator

Thanks for bringing this to our attention. Just to clarify, you're running an OpenSearch 2.2.1 that is running on Zing JDK and you're receiving the following error?

old_gen_collection_time = gc["old"]["collection_time_in_millis"]

KeyError: 'old'

However, when you run with another local cluster on OpenSearch 2.2.1 with Open JDK, you do not experience any issues? We have another issue open (#242) that is experiencing the same issue because OSB currently does not support Shenandoah GC, which does not have concepts of "old, new, or permanent GC".

@IanHoang IanHoang added the enhancement New feature or request label Mar 30, 2023
@IanHoang
Copy link
Collaborator

I'm not familiar with Azul Zing JDK but at quick glance, it looks like it provides an alternate GC (pauseless) compared to OpenJDK (G1GC). This confirms that the issue is similar to #242 and should be regarded as an enhancement rather than a bug because OSB currently supports GCs with concepts of old / young generations like G1GC and CMS GCs.

References:

@IanHoang IanHoang changed the title [ERROR] Cannot execute_test. Error in worker_coordinator ('old') [ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK) Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants