-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cyclonedds kills nodes with THREAD_STATE_ALIVE failed #486
Comments
Could you add steps to replicate this issue? |
@yvzksgl could you share reproducible self-contained sample as @armaganarsln asked? that would be helpful. CC: @eboasson |
Don't worry about the socket buffer configuration stuff in relation to this assertion failure. Cyclone internally tracks what threads are doing to defer releasing memory in some cases. This is an internal thread of Cyclone trying to update some counter related to that mechanism. I know it is an internal thread because it uses the It really is weird that it complains the thread is not registered as "alive". For internal threads, the control block is allocated and initialized before the thread starts doing things and cleaned up after it left the thread's main function. My first thought is that this is bug that we're lucky to have discovered the existence of. I have a suspicion that a self-contained sample might be tricky ... but as a first step towards figuring out what is happening, perhaps you can reproduce it and get stack traces of all the threads? Like, let it core dump, load the core dump in |
Thank you all for your comments. @eboasson I will apply steps you mentioned. However the bug rarely occurs. When I encountered the error again, i will try to provide all you need. |
Hello again. I managed to reproduce the issue. There are '??'s in binary files in backtrace. I tried to solve it, but the solutions i found didn't work. So I decided to share output as it is. I hope it works. Here is the output:
While trying to reproduce the issue, dds asserted one more exception:
I hope these will be helpful. I can try to provide whatever else you need. Thanks in advance. @eboasson |
The dreaded no-debug-symbols villain strikes again ... The second one could plausibly be a consequence of the first one: It does give something to ponder. There are actually multiple threads asserting the same thing at the same time: "tev" and "recvMC". That makes it less likely that it is some quirk in one specific thread, and makes it more likely that the array(s) of thread states have been corrupted. It looks like you don't have many threads in this process, but that doesn't necessarily mean it can't have anything to do with a7dcf8691edf284b775c6fb9e926eeff4591e9c9 because you could be creating and deleting threads all the time. Are you by chance? |
Thak you for your support @eboasson. I have functions in node that use threads. I will try to perform operations using something else instead of using threads in node. I will then repeat the tests I did while trying to kill the node. I will share results here. PS: a7dcf8691edf284b775c6fb9e926eeff4591e9c9 do you have any chance for repost the link? I cannot access. |
Usually GitHub auto creates a link for a commit number: eclipse-cyclonedds/cyclonedds@a7dcf86 Related to it is also eclipse-cyclonedds/cyclonedds@b0727e5. Those two removed the old hard-coded limit on the thread count. I am not aware of any problems with that work, except that your issue made me think of it because your issue seems to have something to do with messing up these tables of thread states. I probably only confused you by mentioning it ... I am sorry about that! |
Hi, I eliminated thread implementation in my node. It seems to have fixed the problem. However, I'm not sure if the problem is caused by wheter my thread implementation or dds. On my previous comments I accidentally shared an incomplete output of gdb. Here is the full output of
I hope it will give some tips. |
Bug report
Required Info:
My related sysctl settings:
net.core.rmem_default = 212992
net.core.rmem_max = 2147483647
net.ipv4.ipfrag_time = 3
While I was running more than 200 nodes on my PC. One of my nodes dies with:
/opt/ros/humble/include/dds/ddsi/q_thread.h:228: thread_state_awake_fixed_domain: Assertion 'thrst->state == THREAD_STATE_ALIVE' failed
.In the readme it says to set net.core.rmem_max=8388608 net.core.rmem_default=8388608. Could my error related with these settings. Should I set them to exactly 8388608 or can I set both 2147483647. Do I have to set max and defaults to same value?
Thanks in advance
The text was updated successfully, but these errors were encountered: