-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rally hangs indefinitely on OSX when running 'esrally race --revision' #1575
Comments
On macOS, it only reproduces when Rally must invoke a build, otherwise, the race succeeds: uname -a
Darwin grape.lan 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64 x86_64 The behavior is consistent with the rally.ini configuration for the local repository cache; i.e., do not cache artifacts: [source]
remote.repo.url = https://github.com/elastic/elasticsearch.git
elasticsearch.src.subdir = elasticsearch
cache = false During a successful benchmark, an stream of
Starting a run we know will fail (one without a cached ES artifact),
then ~85s later right before the failure point:
The repro and error are the same every time. I don't know yet what any of this means, but it has me thinking that either: 1) the ES build is getting in the way, or 2) there is something wrong in the actor system or how it is used. |
You can stub out the build command with something simple (i.e.
We should probably try and put together a minimal reproduction, but TBH I am not super familiar with how this part of the codebase works, and it's going to require quite some effort. |
The differences with the socket local and remote addresses might be a red herring. Since Rally uses thespian 3.10.1, I tried 3.10.3 just to see if it would have an impact and it did not. I suspect the issue has more to do with attempting to act on an invalid TCP socket. I can confirm the issue occurs somewhere in here, reliably, every time ES is built from source:
If Rally does not have to initiate an ES build, the benchmark always succeeds. rally.log
Actor system log
Noting a break here since it takes a moment to download ES source.
The Actor system error occurring right at |
I reported the issue upstream: thespianpy/Thespian#70 |
Right now if anyone faces the issue I'd recommend to apply the local workaround until we hear more from the Thespian side. |
I encountered another way to reproduce the issue after forgetting to use the workaround. The stall can happen directly after downloading elastic/logs track corpora and building file offsets. It reproduces pretty reliably on macOS. https://gist.github.com/inqueue/ddb5c28bb0512ebdc26391bd32e10d91 |
Rally version (get with
esrally --version
): Only attempted reproduction with main/master branch:esrally 2.6.1.dev0
, but suspect more branches are affected.Invoked command:
Configuration file (located in
~/.rally/rally.ini
)):JVM version: N/A (reproducible without JVM installed)
OS version:
Description of the problem including expected versus actual behavior:
Expected: Rally installs and starts specified revision of Elasticsearch and plugin(s) and then runs the specified benchmark experiment against the cluster
Actual: Rally hangs indefinitely, after having successfully built Elastiscsearch and any plugin(s) from source
Note that
esrally install ...
works as expected using the same code paths asesrally race --revision
initially does. The only difference between the two is thatesrally race --revision
starts the Actor Sub system.Steps to reproduce:
python -m venv .venv && source .venv/bin/activate && make install
esrally race --revision="@2022-09-11" --track=geonames --test-mode --kill-running-processes --target-hosts=127.0.0.1:29200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --runtime-jdk=bundled
orpytest -s it -k test_sources
Provide logs (if relevant):
py-spy
dump:Setting Thespian debug logs with:
Allowed me to capture this as Rally 'hung':
Full actor debug logs
Thanks to @pquentin, we found a workaround for this by commenting out line 1342 in
.venv/lib/python3.8/site-packages/thespian/system/transport/TCPTransport.py
We're not exactly sure what is happening (envoyproxy/envoy#1446 suggests that perhaps we're trying to set an option on a socket that is shut down), but this appears to only affect OSX.
The text was updated successfully, but these errors were encountered: