ros::Timer & ros::ServiceServer finalization failure #2283

aurzenligl · 2022-09-28T09:53:36Z

Delivering #2121 broke existing ros::Timer and ros::ServiceServer usages, introducing possibility of using destroyed objects by last callback call. Affected usages in public projects look like this:
https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/include/moveit_servo/collision_check.h
https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/src/collision_check.cpp
One production system I'm aware of was affected as well.

Before #2121 stop/shutdown post-condition ensured callback is not being executed, now it doesn't, allowing destruction of referred-to data and undefined behavior when last callback call continues. Following example illustrates such crash scenario:
https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp

When ros::Timer or ros::ServiceServer is created with a callback capturing data by reference, there is no easy way to guarantee that lifetime of callback execution doesn't exceed the lifetime of data it uses. Problem occurs in code bases where objects composing timers or srv-servers (capturing this) are created/destroyed repeatedly while spinners are running.

Phrasing that documentation uses to describe stop/shutdown/dtor behavior is as such:

ros::Timer "(...) no more callbacks will be called"
ros::ServiceServer "(...) service callback will stop being called"
ros::Subscriber "(...) will stop being called"
ros::NodeHandle::createTimer "(...) callback will no longer be called"

It's imprecise wrt waiting for currently executing callback, but since all above classes did block in stop/shutdown before #2121, and ros::Subscriber still blocks, I assume proper/expected/consistent behavior is to block.

The text was updated successfully, but these errors were encountered:

Making sure currently executing callback finishes before removing thread returns from removeByID allows to avoid race condition in a case when e.g. ros::Timer held in a class and capturing `this` is stopped just before destruction of the class and its data members: https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp

Rationale for the testcase was the following deadlock scenario: https://gist.github.com/iwanders/ede48fb649fd47f9b1f9a52c527b463c Changed testcase presents how the same scenario can be carried out with blocking removeByID (with exception for self-removal). The external mutex from the scenario must be unlocked for the call of ros::Timer::stop, otherwise scenario stays as it is. External thread returns from removeByID once cb call finishes, spinner thread returns immediately.

iwanders · 2022-09-29T15:14:19Z

Following example illustrates such crash scenario: https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp

Yikes, that's definitely not good. Thanks for a minimal example to reproduce this failure.

I assume proper/expected/consistent behavior is to block.
&
doesn't deadlock as long as timer stop is called w/o external mtx locked (from #2284)

I'm not sure I agree with this behaviour, nothing in the documentation states that cancelling the timer will only happen after any current callbacks complete their execution. In my opinion cancelling a timer should be non blocking in all situations. But I think this is a matter of opinion, and different people may have different assumptions around this.

If it is blocking until any callbacks are completed (like in #2284) we go back to the situation where it's very easy to accidentally make the system deadlock with an external mutex in play. Requiring the user to unlock its mutex to allow any callback to complete before being able to cancel the timer (as proposed there) may not be possible to do without introducing any thread safety issues on the user side. 🤔

I'm not too sure yet how we can solve this in a way that works for all use cases, especially if the historic behaviour has been to block and there's code out there that relies on that. Maybe we revert the fix for the deadlock (#2121) and update the documentation? Stating that all running callbacks must be able to finish while blocked on the timer stop? Basically reverting back to the old behaviour and possibly deadlocking when there's a user mutex involved.

fyi @guillaumeautran , @mikepurvis , @jasonimercer for awareness. The timer deadlock fix from #2121 (CORE-18232) introduces memory unsafety in some situations.

aurzenligl · 2022-09-29T15:33:08Z

If it is blocking until any callbacks are completed (like in #2284) we go back to the situation where it's very easy to accidentally make the system deadlock with an external mutex in play.

True, external mutex use-case can coexist with the old blocking stop behavior, assuming that:

ros-comm: does not block removeByID if we're self-removing (allowing timer's cb to stop itself seems a valid use case)
user: unlocks external mutex just for the time of timer.stop call
user: in function with ASB structure (where: A: above stop, S: timer.stop, B: below stop), AASBB must be legal
- that might be troublesome

I'm not too sure yet how we can solve this in a way that works for all use cases, especially if the historic behaviour has been to block and there's code out there that relies on that. Maybe we revert the fix for the deadlock (#2121) and update the documentation? Stating that all running callbacks must be able to finish while blocked on the timer stop? Basically reverting back to the old behaviour and possibly deadlocking when there's a user mutex involved.

It would be great to have the old blocking behavior back and defined in documentation. 👍 Still, self-removal might be a worthwhile addition, previously self-removal would deadlock, so couldn't have been relied on.

aurzenligl added a commit to aurzenligl/ros_comm that referenced this issue Sep 28, 2022

Remove trailing whitespace (ros#2283)

35aab20

aurzenligl added a commit to aurzenligl/ros_comm that referenced this issue Sep 28, 2022

New testcase checks removeByID blocks until cb call finishes (ros#2283)

6f31e62

aurzenligl mentioned this issue Sep 28, 2022

Fix for timer/srv-server finalization issue 2283 #2284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ros::Timer & ros::ServiceServer finalization failure #2283

ros::Timer & ros::ServiceServer finalization failure #2283

aurzenligl commented Sep 28, 2022 •

edited

Loading

iwanders commented Sep 29, 2022

aurzenligl commented Sep 29, 2022 •

edited

Loading

ros::Timer & ros::ServiceServer finalization failure #2283

ros::Timer & ros::ServiceServer finalization failure #2283

Comments

aurzenligl commented Sep 28, 2022 • edited Loading

iwanders commented Sep 29, 2022

aurzenligl commented Sep 29, 2022 • edited Loading

aurzenligl commented Sep 28, 2022 •

edited

Loading

aurzenligl commented Sep 29, 2022 •

edited

Loading