Skip to content

Latest commit

 

History

History
83 lines (52 loc) · 4.6 KB

README.md

File metadata and controls

83 lines (52 loc) · 4.6 KB

ROS2 QoS Reliability Issue

tl;dr: using tools such as RViz over WiFi can cause a robot to stall, even when all topics are subscribed using the "best_effort" policy - seemingly it's functionally equivalent to the "reliable" policy

Introduction

Mobile robots are very popular in the ROS community, and they necessitate wireless communication for real-time monitoring and debugging. Tools such as RViz are commonly used over a WiFi connection to visualize robot sensor data, path-planning routes, etc. Unlike fixed robots connected via a wired connection where the bandwidth and latency are fairly constant, the bandwidth and latency on a WiFi between the mobile robot and a developer's workstation can vary considerably and often unpredictably as a function of the robot and the developer's relative positions. This shouldn't be a issue, as the developer's view of the robot's own perspective on the world is supplemental to the robot's operation - not critical. This is what the ROS2 topic reliability policies were designed for, a reliable subscription policy that can be used for mission-critical connections, possibly internal to the robot, and a "best_effort" subscription policy that can be used for non-critical monitoring and debugging.

ROS2 used over WiFi and Cellular for monitoring

The ROS2 documentation on QoS polices states that "ROS 2 can be as reliable as TCP or as best-effort as UDP".

As will be demonstrated however, there is seemingly no functional difference between the two policies - both act as a reliable subscriber, meaning that the publisher will throttle its publishing rate if the subscriber can't keep up, which is often the case for high-bandwidth topics over a wireless connection.

This means that for mobile robots, where engineers use tools such as RViz monitoring topics over a WiFi link, should that link degrade, the internal communication on the robot also degrades, even when using shared memory. As such, the behaviour of the robot as a whole under test, and in the field, could be drastically different.

For us this often causes our robot to grind to a halt if we're trying to debug the robot's navigation using RViz and we loose line of sight. This could be alleviated by only using ROS2 DDS for internal communication on the host, and a ROS bridge to external devices. But then the benefits of using DDS are lost.

Better WiFi coverage and higher bandwidth APs could also alleviated the issue somewhat, but still offers no guarantees, and in principle should not have an effect at all - such non-determinism is a non-starter in many industrial applications.

Reproduction

The reproducible demo for convenience uses containers to simulate two remote hosts, but I've also reproduced this issue outside of containers between two real remote hosts, and that both the FastDDS and Cyclone Iceyorx shared-memory implementations are also affected by slow remote subscribers.

Prerequisites

Webcam at /dev/video0 and Podman installed.

Download and build

git clone https://github.com/ciandonovan/ros2_qos_reliability.git
cd ros2_qos_reliability
podman build -t $(basename $PWD) .

Run

export RMW_IMPLEMENTATION=rmw_fastrtps_cpp or export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

Simulate a constrained WiFi network on the loopback interface.

sudo tc qdisc add dev lo root tbf rate 10mbit burst 10mbit latency 50ms

N.B. use sudo iftop -i lo to monitor loopback traffic in real-time.

Run ROS2 camera publisher in one terminal.

podman run --rm -it --name ros2_publisher --net=host --device=/dev/video0:/dev/video0:rw --env RMW_IMPLEMENTATION ros2_qos_reliability:latest ros2 run v4l2_camera v4l2_camera_node --ros-args --log-level debug

Run ROS2 camera subscriber in another.

podman run --rm -it --name ros2_subscriber --net=host --env RMW_IMPLEMENTATION ros2_qos_reliability:latest ros2 topic echo --qos-reliability best_effort /image_raw

Confirm publisher and subscriber reliability settings

podman exec -it --env RMW_IMPLEMENTATION ros2_publisher /ros_entrypoint.sh ros2 topic info --verbose /image_raw

Delete tc qdisc on the loopback when finished.

sudo tc qdisc del dev lo root

Comments

The "reliable" subscription behaves as expected, the publisher throttles its publishing rate so until the point where all subscribers can receive all published messages given the bandwidth. This can be seen by observing the publisher's debug messages.

However, this is also the case for the "best_effort" subscription.