Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ros::Timer & ros::ServiceServer finalization failure #2283

Open
aurzenligl opened this issue Sep 28, 2022 · 2 comments
Open

ros::Timer & ros::ServiceServer finalization failure #2283

aurzenligl opened this issue Sep 28, 2022 · 2 comments

Comments

@aurzenligl
Copy link

aurzenligl commented Sep 28, 2022

Delivering #2121 broke existing ros::Timer and ros::ServiceServer usages, introducing possibility of using destroyed objects by last callback call. Affected usages in public projects look like this:
https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/include/moveit_servo/collision_check.h
https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/src/collision_check.cpp
One production system I'm aware of was affected as well.

Before #2121 stop/shutdown post-condition ensured callback is not being executed, now it doesn't, allowing destruction of referred-to data and undefined behavior when last callback call continues. Following example illustrates such crash scenario:
https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp

When ros::Timer or ros::ServiceServer is created with a callback capturing data by reference, there is no easy way to guarantee that lifetime of callback execution doesn't exceed the lifetime of data it uses. Problem occurs in code bases where objects composing timers or srv-servers (capturing this) are created/destroyed repeatedly while spinners are running.

Phrasing that documentation uses to describe stop/shutdown/dtor behavior is as such:

It's imprecise wrt waiting for currently executing callback, but since all above classes did block in stop/shutdown before #2121, and ros::Subscriber still blocks, I assume proper/expected/consistent behavior is to block.

aurzenligl added a commit to aurzenligl/ros_comm that referenced this issue Sep 28, 2022
Making sure currently executing callback finishes before removing
thread returns from removeByID allows to avoid race condition in
a case when e.g. ros::Timer held in a class and capturing `this`
is stopped just before destruction of the class and its data members:
https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp
aurzenligl added a commit to aurzenligl/ros_comm that referenced this issue Sep 28, 2022
aurzenligl added a commit to aurzenligl/ros_comm that referenced this issue Sep 28, 2022
Rationale for the testcase was the following deadlock scenario:
https://gist.github.com/iwanders/ede48fb649fd47f9b1f9a52c527b463c

Changed testcase presents how the same scenario can be carried
out with blocking removeByID (with exception for self-removal).
The external mutex from the scenario must be unlocked for the
call of ros::Timer::stop, otherwise scenario stays as it is.
External thread returns from removeByID once cb call finishes,
spinner thread returns immediately.
@iwanders
Copy link
Contributor

Following example illustrates such crash scenario: https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp

Yikes, that's definitely not good. Thanks for a minimal example to reproduce this failure.

I assume proper/expected/consistent behavior is to block.
&
doesn't deadlock as long as timer stop is called w/o external mtx locked (from #2284)

I'm not sure I agree with this behaviour, nothing in the documentation states that cancelling the timer will only happen after any current callbacks complete their execution. In my opinion cancelling a timer should be non blocking in all situations. But I think this is a matter of opinion, and different people may have different assumptions around this.

If it is blocking until any callbacks are completed (like in #2284) we go back to the situation where it's very easy to accidentally make the system deadlock with an external mutex in play. Requiring the user to unlock its mutex to allow any callback to complete before being able to cancel the timer (as proposed there) may not be possible to do without introducing any thread safety issues on the user side. 🤔

I'm not too sure yet how we can solve this in a way that works for all use cases, especially if the historic behaviour has been to block and there's code out there that relies on that. Maybe we revert the fix for the deadlock (#2121) and update the documentation? Stating that all running callbacks must be able to finish while blocked on the timer stop? Basically reverting back to the old behaviour and possibly deadlocking when there's a user mutex involved.

fyi @guillaumeautran , @mikepurvis , @jasonimercer for awareness. The timer deadlock fix from #2121 (CORE-18232) introduces memory unsafety in some situations.

@aurzenligl
Copy link
Author

aurzenligl commented Sep 29, 2022

If it is blocking until any callbacks are completed (like in #2284) we go back to the situation where it's very easy to accidentally make the system deadlock with an external mutex in play.

True, external mutex use-case can coexist with the old blocking stop behavior, assuming that:

  • ros-comm: does not block removeByID if we're self-removing (allowing timer's cb to stop itself seems a valid use case)
  • user: unlocks external mutex just for the time of timer.stop call
  • user: in function with ASB structure (where: A: above stop, S: timer.stop, B: below stop), AASBB must be legal
    • that might be troublesome

I'm not too sure yet how we can solve this in a way that works for all use cases, especially if the historic behaviour has been to block and there's code out there that relies on that. Maybe we revert the fix for the deadlock (#2121) and update the documentation? Stating that all running callbacks must be able to finish while blocked on the timer stop? Basically reverting back to the old behaviour and possibly deadlocking when there's a user mutex involved.

It would be great to have the old blocking behavior back and defined in documentation. 👍 Still, self-removal might be a worthwhile addition, previously self-removal would deadlock, so couldn't have been relied on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants