Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed out while waiting for action server to acknowledge goal request for compute_path_through_poses (or follow_path) #4278

Closed
DylanDeCoeyer-Quimesis opened this issue Apr 25, 2024 · 5 comments

Comments

@DylanDeCoeyer-Quimesis
Copy link
Contributor

DylanDeCoeyer-Quimesis commented Apr 25, 2024

Bug report

Required Info:

  • Operating System:
    • Custom Linux image based on Linux for Tegra (L4T) R32.7.4
  • ROS2 Version:
    • Iron
  • Version or commit hash:
    • 2024-02-09
  • DDS implementation:
    • Fast-RTPS

Steps to reproduce issue

The issue occurs randomly when the bt_navigator sends a goal to the compute_path_through_poses action provided by the planner_server node. The BT node used to send the request is ComputePathThroughPosesAction. In my case, the ComputePathThroughPoses action request contains a list of 3 goals.

Expected behavior

The planner_server receives and acknowledges the request, computes the path, then returns the result to the bt_navigator.

Actual behavior

The planner_server receives the request and computes the path, but the bt_navigator does not seem to received the acknowledgement and ends up triggering the timeout. Here are the logs, with some additional information.

Apr 25 13:22:17.620047 [bt_navigator-12] [INFO]: Path is obsolete, computing a new one
Apr 25 13:22:17.624897 [planner_server-9] [INFO]: Attempting to find a path from (8.61, 2.92) to (4.99, 3.79).
Apr 25 13:22:17.625304 [planner_server-9] [INFO]: SmacPlannerHybrid: Start pose in world: (8.612866, 2.919776), in costmap (96, 39)
Apr 25 13:22:17.625636 [planner_server-9] [INFO]: SmacPlannerHybrid: Goal pose in world: (4.992264, 3.793334), in costmap (59, 47)
Apr 25 13:22:17.625954 [planner_server-9] [INFO]: Attempting to find a path from (4.99, 3.79) to (3.01, 3.79).
Apr 25 13:22:17.626311 [planner_server-9] [INFO]: SmacPlannerHybrid: Start pose in world: (4.992264, 3.793334), in costmap (59, 47)
Apr 25 13:22:17.626693 [planner_server-9] [INFO]: SmacPlannerHybrid: Goal pose in world: (3.015000, 3.793334), in costmap (40, 47)
Apr 25 13:22:17.627004 [planner_server-9] [INFO]: Attempting to find a path from (3.01, 3.79) to (1.01, 5.79).
Apr 25 13:22:17.627244 [planner_server-9] [INFO]: SmacPlannerHybrid: Start pose in world: (3.015000, 3.793334), in costmap (40, 47)
Apr 25 13:22:17.628079 [planner_server-9] [INFO]: SmacPlannerHybrid: Goal pose in world: (1.015000, 5.793334), in costmap (20, 67)
Apr 25 13:22:17.669789 [bt_navigator-12] [WARN]: Behavior tree engine cycle is slower than expected: 59 ms (timeout is 50 ms)
Apr 25 13:22:17.670153 [bt_navigator-12] [WARN]: Time required to tick the tree is 58 ms
Apr 25 13:22:17.702070[bt_navigator-12] [WARN]: Timed out while waiting for action server to acknowledge goal request for compute_path_through_poses
Apr 25 13:22:17.707512 [bt_navigator-12] [WARN]: Behavior tree engine cycle is slower than expected: 50 ms (timeout is 50 ms)

Additional information

I noticed that the issue also happens with the follow_path action. But it never happens when the bt_navigator triggers my custom actions.

Relevant bt_navigator configuration.

bt_navigator:
  ros__parameters:
    default_server_timeout: 100
    bt_loop_duration: 50

When the issue occurs, the CPU load is not especially high (~50%).

@SteveMacenski
Copy link
Member

I don't have a good answer for you about this, I have not seen this and this isn't a known cause for any of the flakiness in our system testing (which we call these servers regularly from)

I'm curious if its related to your BT tick rate, you're running it at 20 hz so maybe its too slow or you need to increase the timeout? Have you tried this with other DDS vendors like Cyclone and still see it occurring?

@DylanDeCoeyer-Quimesis
Copy link
Contributor Author

@SteveMacenski thanks for your answer. I forgot to mention that we use a discovery server.

I tried to switch to CycloneDDS, but encountered the following error : ros2/rmw_cyclonedds#458 . The proposed solution does not work for me at the moment. I'll keep investigating on monday.

@DylanDeCoeyer-Quimesis
Copy link
Contributor Author

Hi, I finally managed to fix the Failed to find a free participant index for domain 0 error by using the proposed solution.

Our robot has been running with CycloneDDS for the past 2 days, and the action acknowledgment timeout error has not occurred once. This is very encouraging since it used to occur a few times per day with Fast-DDS.

@SteveMacenski Regarding your remark about the BT tick rate, why would it be too slow ? I don't see a direct link with the observed timeout. Also, the chosen value of 100ms for the timeout seems already high to me. A normal action acknowledgement is usually received within a few ms. I don't think that increasing it will help.

@SteveMacenski
Copy link
Member

Interesting. You may want to file a ticket with RMW Fast DDS (https://github.com/ros2/rmw_fastrtps) that some current behavior has caused you an issue and link back here. I'm sure they're going to ask you to come up with some more minimal test they can reproduce so they can dig into it when they have the cycles.

I didn't know if one, other, both, or neither was the cause of the issue, I was giving you some options of things to test to see what made a behavioral change that I could dig into :-) It seems that the DDS or RMW is a likely cause so its worth making folks over there aware of this in case it strikes a chord.

You can safely ignore the tick rate stuff! I think that largely closes this ticket out for Nav2? Not much we can do on the RMW front unfortunately over here.

@alanxuefei
Copy link
Contributor

alanxuefei commented Nov 24, 2024

We encountered a similar issue, which may be related to the real-time capability or configuration of DDS. The solution is still under investigation. Increasing the timeout (e.g., to 5 seconds) may not solve the issue, as it could still occur, and a 5-second delay may be too long for certain actions (e.g., the controller).

rclcpp/rclcpp_action/src/server.cpp

void
ServerBase::execute_goal_request_received()
{
  ...
  auto response_pair = call_handle_goal_callback(uuid, message);  
  {
    std::lock_guard<std::recursive_mutex> lock(pimpl_->action_server_reentrant_mutex_);
    /**********************
     If the client receives the acknowledgment after a timeout, the client will assume the request has failed. 
     However, the server will still proceed with executing call_goal_accepted_callback, which initiates the worker thread to process the goal.
    **********************/
    ret = rcl_action_send_goal_response(  
      pimpl_->action_server_.get(),
      &request_header,
      response_pair.second.get());
  }

  ...
  if (GoalResponse::ACCEPT_AND_EXECUTE == status || GoalResponse::ACCEPT_AND_DEFER == status) {
    
    ...
    call_goal_accepted_callback(handle, uuid, message);
  }
}

If we look at the service, it runs a callback that may trigger a thread before sending a response. However, it may encounter a similar issue.

rclcpp/rclcpp/include/rclcpp/service.hpp

  void
  handle_request(
    std::shared_ptr<rmw_request_id_t> request_header,
    std::shared_ptr<void> request) override
  {
    auto typed_request = std::static_pointer_cast<typename ServiceT::Request>(request);
    auto response = any_callback_.dispatch(this->shared_from_this(), request_header, typed_request);
    if (response) {
      send_response(*request_header, *response);
    }
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants