[ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK) #206

rudziankou · 2022-08-16T19:28:52Z

Hi folks, I ran the benchmark against an existing OpenSearch 2.2.1 cluster and got the following error:

2022-10-25 17:24:05,778 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO Telling worker_coordinator to start benchmark.
2022-10-25 17:24:05,779 ActorAddr-(T|:53454)/PID:51383 osbenchmark.worker_coordinator.worker_coordinator INFO Benchmark is about to start.
2022-10-25 17:24:05,780 ActorAddr-(T|:53454)/PID:51383 osbenchmark.worker_coordinator.worker_coordinator INFO Attaching cluster-level telemetry devices.
2022-10-25 17:24:06,801 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO Received a benchmark failure from [ActorAddr-(T|:53454)] and will forward it now.
2022-10-25 17:24:06,699 ActorAddr-(T|:53454)/PID:51383 osbenchmark.telemetry INFO JvmStatsSummary on benchmark start
2022-10-25 17:24:06,783 ActorAddr-(T|:53454)/PID:51383 osbenchmark.actor ERROR Error in worker_coordinator
Traceback (most recent call last):

78e55e46-f20f-455e-8969-2d1685e64167 File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/actor.py", line 92, in guard
return f(self, msg, sender)

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 265, in receiveMsg_StartBenchmark
self.coordinator.start_benchmark()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 657, in start_benchmark
self.telemetry.on_benchmark_start()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/telemetry.py", line 85, in on_benchmark_start
device.on_benchmark_start()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/telemetry.py", line 1381, in on_benchmark_start
self.jvm_stats_per_node = self.jvm_stats()

File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/telemetry.py", line 1440, in jvm_stats
old_gen_collection_time = gc["old"]["collection_time_in_millis"]

KeyError: 'old'

2022-10-25 17:24:06,797 ActorAddr-(T|:53454)/PID:51383 osbenchmark.actor INFO A workload preparator has exited.
2022-10-25 17:24:06,805 -not-actor-/PID:51345 osbenchmark.test_execution_orchestrator ERROR A benchmark failure has occurred
2022-10-25 17:24:06,806 -not-actor-/PID:51345 osbenchmark.test_execution_orchestrator INFO Telling benchmark actor to exit.
2022-10-25 17:24:06,807 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring).
2022-10-25 17:24:06,808 ActorAddr-(T|:53454)/PID:51383 osbenchmark.actor INFO Main worker_coordinator received ActorExitRequest and will terminate all load generators.
2022-10-25 17:24:06,810 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:53454)] (ignoring).
2022-10-25 17:24:06,809 ActorAddr-(T|:53451)/PID:51382 osbenchmark.actor INFO BuilderActor#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:53432)])
2022-10-25 17:24:06,810 ActorAddr-(T|:53432)/PID:51363 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:53451)] (ignoring).
2022-10-25 17:24:09,812 -not-actor-/PID:51345 osbenchmark.Benchmark INFO Attempting to shutdown internal actor system.
2022-10-25 17:24:09,819 -not-actor-/PID:51362 root INFO ActorSystem Logging Shutdown
2022-10-25 17:24:09,843 -not-actor-/PID:51361 root INFO ---- Actor System shutdown
2022-10-25 17:24:09,846 -not-actor-/PID:51345 osbenchmark.benchmark INFO Actor system is still running. Waiting...
2022-10-25 17:24:10,853 -not-actor-/PID:51345 osbenchmark.benchmark INFO Shutdown completed.
2022-10-25 17:24:10,854 -not-actor-/PID:51345 osbenchmark.benchmark ERROR Cannot run subcommand [execute_test].
Traceback (most recent call last):
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/benchmark.py", line 893, in dispatch_sub_command
execute_test(cfg, args.kill_running_processes)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/benchmark.py", line 661, in execute_test
with_actor_system(test_execution_orchestrator.run, cfg)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/benchmark.py", line 688, in with_actor_system
runnable(cfg)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 379, in run
raise e
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 376, in run
pipeline(cfg)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 69, in call
self.target(cfg)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 314, in benchmark_only
return execute_test(cfg, external=True)
File "/Users/user1/work/opensearch-benchmark/venv/lib/python3.9/site-packages/osbenchmark/test_execution_orchestrator.py", line 273, in execute_test
raise exceptions.BenchmarkError(result.message, result.cause)
osbenchmark.exceptions.BenchmarkError: Error in worker_coordinator ('old')

Here is a command that I run:
opensearch-benchmark execute_test --workload nyc_taxis --pipeline benchmark-only --target-hosts "host01:8900" --client-options "verify_certs:false,use_ssl:true,basic_auth_user:admin,basic_auth_password:admin"

It looks like the issue is here. I see some mismatch with _nodes/stats output:
https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/telemetry.py#L1440-L1443

_nodes/stats output:
"jvm": {
"timestamp": 1666719053364,
"uptime_in_millis": 1712255684,
"mem": {
"heap_used_in_bytes": 13321109504,
"heap_used_percent": 79,
"heap_committed_in_bytes": 16682844160,
"heap_max_in_bytes": 16682844160,
"non_heap_used_in_bytes": 3457548288,
"non_heap_committed_in_bytes": 3457548288,
"pools": {}
},
"threads": {
"count": 67,
"peak_count": 70
},
"gc": {
"collectors": {
"GPGC New": {
"collection_count": 85,
"collection_time_in_millis": 6904
},
"GPGC Old": {
"collection_count": 85,
"collection_time_in_millis": 52220
}
}
},
"buffer_pools": {
"mapped": {
"count": 0,
"used_in_bytes": 0,
"total_capacity_in_bytes": 0
},
"direct": {
"count": 20,
"used_in_bytes": 8463983,
"total_capacity_in_bytes": 8463982
}
},
"classes": {
"current_loaded_count": 20123,
"total_loaded_count": 22803,
"total_unloaded_count": 2680
}
}

@IanHoang @travisbenedict guys, could you please check?

The text was updated successfully, but these errors were encountered:

rudziankou · 2023-01-19T21:59:00Z

I deployed a single node cluster on my local and compared the /_nodes/stats/jvm query outputs:
Existing cluster:
"gc": {
"collectors": {
"GPGC New": {
"collection_count": 6163,
"collection_time_in_millis": 1739134
},
"GPGC Old": {
"collection_count": 1095,
"collection_time_in_millis": 1544272
}
}
}
New single node local cluster
"gc": {
"collectors": {
"young": {
"collection_count": 9,
"collection_time_in_millis": 203
},
"old": {
"collection_count": 0,
"collection_time_in_millis": 0
}
}
}

The existing cluster is running on Zing JDK. The local cluster is running on OpenJDK. Some metrics have different names in Zing. That's why Benchmark is failing for the existing clusters.

IanHoang · 2023-03-30T15:03:43Z

Thanks for bringing this to our attention. Just to clarify, you're running an OpenSearch 2.2.1 that is running on Zing JDK and you're receiving the following error?

old_gen_collection_time = gc["old"]["collection_time_in_millis"]

KeyError: 'old'

However, when you run with another local cluster on OpenSearch 2.2.1 with Open JDK, you do not experience any issues? We have another issue open (#242) that is experiencing the same issue because OSB currently does not support Shenandoah GC, which does not have concepts of "old, new, or permanent GC".

IanHoang · 2023-03-30T17:13:08Z

I'm not familiar with Azul Zing JDK but at quick glance, it looks like it provides an alternate GC (pauseless) compared to OpenJDK (G1GC). This confirms that the issue is similar to #242 and should be regarded as an enhancement rather than a bug because OSB currently supports GCs with concepts of old / young generations like G1GC and CMS GCs.

References:

IanHoang added the enhancement New feature or request label Mar 30, 2023

IanHoang changed the title ~~[ERROR] Cannot execute_test. Error in worker_coordinator ('old')~~ [ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK) Apr 4, 2023

IanHoang mentioned this issue May 2, 2023

OSB is not compatible with Shenandoah GC | Java 17 #242

Closed

IanHoang mentioned this issue Jun 13, 2023

[META] Support Various Garbage Collectors #333

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK) #206

[ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK) #206

rudziankou commented Aug 16, 2022 •

edited

Loading

rudziankou commented Jan 19, 2023

IanHoang commented Mar 30, 2023

IanHoang commented Mar 30, 2023

[ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK) #206

[ERROR] Cannot execute_test. Error in worker_coordinator ('old') (Azul GC / Zing JDK) #206

Comments

rudziankou commented Aug 16, 2022 • edited Loading

rudziankou commented Jan 19, 2023

IanHoang commented Mar 30, 2023

IanHoang commented Mar 30, 2023

rudziankou commented Aug 16, 2022 •

edited

Loading