-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang loading components #182
Comments
I have updated my Iron environment and am still experiencing the same issue where a node loading a component in the bash for loop eventually hangs. I am observing a different issue in Jazzy and will follow up with a more detailed description. Between the Iron and Jazzy releases, the use of the global logger mutex in That is not the case for Jazzy: This was likely changed in rclpy by: That pull request was to address a similar kind of issue, but in our case a different set of RMW specific locks: |
As mentioned above, in Iron, a node loading a component in the bash for loop will eventually experience a hang. In Jazzy, the component_container node will eventually experience a hang.
Thread 1 wants to acquire a read lock on Another thread must be have a write lock on
We can see that thread 2 on frame 20 acquires the write lock Thread 2 is also blocked and cannot release the We can see that thread 1 is also in rmw_zenoh: rolling (66ec842) |
I don't think the problem is related to #170 or to python.
So the root of the problem occurs on Zenoh control messages (e.g. liveliness events) leading to call a user callback: Zenoh is keeping an internal lock that is blocking any other thread trying to send a message with Zenoh. @Yadunund in the meanwhile, a workaround could be to make sure no mutex is taken in liveliness callbacks. |
As discussed yesterday, It is a bit strange that logging from the |
Actually I was wrong, sorry for the confusion... I was confused with frame |
That's good to know. Shall wait for the fix for eclipse-zenoh/zenoh#1092. |
We have a fix candidate in a branch: https://github.com/eclipse-zenoh/zenoh/commits/dev/reentrant_declares/ @hwoithe I reproduced the same issue with your scenario, and switching to #200 fixes the issue for me. |
I am a bit short on time today, but I was able to quickly take a look and I am still experiencing the same problem under Jazzy where the I have not yet retested Iron. A quick debug session:
Thread 1 wants the read lock in frame 5. Thread 1 has acquired the Thread 5 wants to acquire the Thread 5 has the write lock in I don't know the codebase well enough yet and haven't found if Thread 5 is supposed to drop the write lock and/or Other threads:
Thread 6 in frame 5; Thread 8 in frame 5: Thread 9 in frame 5: Also, is this referencing the correct issue number? My environment:
rclcpp: 8ab10aabc0304fe2a26f54a67274cdbc51b65602 |
Thanks for your testing and quick feedback. We'll analyse your backtraces.
The purpose of the patch is precisely to release the
It isn't since this patch is not in a pull request yet. I will update the comment once we have a zenoh PR number for it. |
Sounds great! I just wanted to make sure, since |
There was indeed some oversights in I tested your scenario several times with this and I still don't experience any deadlock. |
This is in Jazzy with rmw_zenoh f697984 (#200). The deadlock happens to a node loading a component in the bash for loop. Debugging the node:
Thread 1 wants to acquire
Thread 4 wants to acquire Thread 4 has acquired
Thread 1 has acquired rmw_zenoh: f697984 (#200) |
This is the second scenario in Jazzy with rmw_zenoh f697984 (#200) where a node loading a component in the bash for loop hangs. It gets stuck in
rcl: 00c29cd6e72caae36e1e749a8c02558ee709a275 |
I've also tested the scenario several times with #200 + Rolling and I can't reproduce the problem anymore. It could be that something changed between Rolling and Jazzy too? I'm looking into that now. @JEnoch could we merge the relevant upstream zenoh PRs and update the hash in #200 accordingly so that we can get that in? |
I have not yet retested in Iron. In Jazzy, I am still experiencing deadlocks with the graph mutex and a logging mutex similar to #182 (comment). It seems to be easier to trigger if we use @Yadunund or @clalancette, could you please retest this with the Terminal 1:
Terminal 2:
Terminal 3 (now uses
Debugging session:
Thread 1 in frame 7 wants to acquire
Thread 1 has acquired Thread 2 in frame 8 wants to acquire Thread 2 has acquired
rmw_zenoh: d3396a5 |
Ah I am able to reproduce it in subsequent tries. However if I add a sleep in between, I succeed each time.
nonetheless there is still a chance for deadlock in graph_cache. Investigating... |
In general, I now run the parallel version to reproduce the graph lock issue and the non-parallel version to reproduce the rmw_wait issue. I am still experiencing a hang on rmw_wait from a node in the for loop, even with #203. With #203:
The node is waiting in frame 5: rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp Line 3277 in 45391cc
I have not verified whether the appropriate events are being correctly sent or received in the first place. rmw_zenoh: 45391cc |
Thanks @hwoithe. Will keep digging... |
@clalancette and I took a closer look at the stack trace from #182 (comment) and discovered that it is indeed an ABBA deadlock resulting from the two threads trying to acquire both The problem arises from the I'm working on a fix where we rely on our own logging functions in The deadlock in |
Would you suggest that all RMW implementations migrate to their own logging mechanism to avoid potential logger related deadlocks? rmw_fastrtps has encountered this as well: I can see the value of a RMW safe replacement (or fix) to the |
As it stands, that is probably not feasible. The way the RCUTILS macros work is with a global logging callback. Even if we did have some way to switch back and forth between the "console-only" and the "client-library" installed callback (which we currently don't), I don't see a good way we could do it in a thread-safe manner. That said, we could add a new set of logging APIs to For now, I think the easiest thing to do (and backport) is to fix this locally in |
I was thinking this as well and in the long term would be nice to have. Thanks to all for the help on this issue. |
I believe that the basic issue here was fixed by #223. With that, I'm going to close this out. If you are still having problems, please feel free to reopen or open a new issue. |
I am experience a hang while loading many components with rmw_zenoh.
Terminal 1:
Terminal 2:
Terminal 3:
I would expect the for loop above to eventually finish but it does not.
Let's attach to the process and investigate:
We can see from frame 5 in thread 1 that it wants to acquire the ctrl_lock:
https://github.com/eclipse-zenoh/zenoh/blob/763a05f0d61a894cf3cbeb07223e2ea726777325/zenoh/src/net/routing/dispatcher/face.rs#L189
Another thread must be holding that lock:
Frame 24 on thread 9 indicates the thread should have acquired the
ctrl_lock
earlier in the function if it is callingundeclare_subscription()
:https://github.com/eclipse-zenoh/zenoh/blob/763a05f0d61a894cf3cbeb07223e2ea726777325/zenoh/src/net/routing/dispatcher/face.rs#L208
This means that thread 9 is holding the
ctrl_lock
that thread 1 is waiting on.Thread 9 is also waiting:
Thread 9 is waiting on some resource held by thread 1.
Frame 9 on thread 9 wants to acquire
logging_mutex_
inrclpy_thread_safe_logging_output_handler()
:https://github.com/ros2/rclpy/blob/52f2a69325721e1368ac3176adcc348e1f53f93e/rclpy/src/rclpy/logging.cpp#L51
Frame 15 on thread 1 should have acquired
logging_mutex_
right beforercl_node_init()
:https://github.com/ros2/rclpy/blob/52f2a69325721e1368ac3176adcc348e1f53f93e/rclpy/src/rclpy/node.cpp#L464
This means that thread 1 is holding the
logging_mutex_
that thread 9 is waiting on.Hopefully I have read the situation correctly here and, at the very least, others are able to reproduce the problem. I will continue this investigation after the US holiday weekend.
ROS Distribution: Iron
OS: Ubuntu Jammy
rmw_zenohd: commit 1ac6bd2.
rust: rustc 1.78.0 (9b00956e5 2024-04-29)
The text was updated successfully, but these errors were encountered: