Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nav2 test analysis #38

Open
evshary opened this issue Nov 5, 2024 · 21 comments
Open

nav2 test analysis #38

evshary opened this issue Nov 5, 2024 · 21 comments

Comments

@evshary
Copy link

evshary commented Nov 5, 2024

Package # Failing tests - Rolling # Failing tests -dev/1.0.0 ID of Zenoh falling tests
costmap_queue 0 0 () - ()
dwb_core 1 0 (6) - (0)
dwb_critics 0 0 () - ()
dwb_plugins 0 0 () - ()
nav_2d_utils 0 0 () - ()
nav2_amcl 0 0 () - ()
nav2_behavior_tree 4 11 (7,17,18,28) - (17,18,27,34,38,39,40,44,45,46,47)
nav2_behaviors 0 0 () - ()
nav2_bringup 0 0 () - ()
nav2_bt_navigator 0 0 () - ()
nav2_collision_monitor 1 3 (7) - (7,11,12)
nav2_constrained_smoother 0 0 () - ()
nav2_controller 3 1 (6,7,8)-(6)
nav2_core 0 0 () - ()
nav2__costmap_2d 14 14 (8,9,10,15,17,19,20,21,22,23,24,25,26,28)-(8,9,10,15,17,18,19,20,21,22,23,24,25,26)
nav2_graceful_controller 1 0 (6) - ()
nav2_lifecycle_manager 0 1 () - (10)
nav2_loopback_sim 0 0 () - ()
nav2_map_server 5 4 (8,9,10,11,12) - (8,9,10,11)
nav2_mppi_controller 12 12 (6,7,8,9,10,11,12,13,14,15,16,17)- (6,7,8,9,10,11,12,13,14,15,16,17)
nav2_navfn_planner 1 1 (7) - (7)
nav2_planner 1 1 (7) - (7)
nav2_regulated_pure_pursuit_controller 1 1 (6) - (6)
nav2_rotation_shim_controller 1 1 (6) - (6)
nav2_rviz_plugins 0 0 () - ()
nav2_simple_commander 0 0 () - ()
nav2_smac_planner 11 11 (10,11,12,13,14,15,16,17,18,19,20)-(10,11,12,13,14,15,16,17,18,19,20)
nav2_smoother 2 2 (7,8)-(7,8)
nav2_system_tests 13 16 (12,14,15,17,18,19,20,21,22,23,29,30,31)-(9,12,14,15,18,19,20,21,22,23,29,30,31)
nav2_theta_star_planner 1 1 (1) - (1)
nav2_util 6 5 (7,10,13,15,16,17)-(7,10,13,15,17)
nav2_velocity_smoother 1 1 (6) - (6)
nav2_voxel_grid 0 0 () - ()
nav2_waypoint_follower 2 2 (7,8) - (7,8)
opennav_docking 7 7 (9,10,11,12,13,14,16) -(9,10,11,12,13,14,16)
opennav_docking_bt 0 2 () - (7,8)
opennav_docking_core 0 0 () - ()
@alireza-moayyedi
Copy link

alireza-moayyedi commented Nov 7, 2024

Hello @evshary , this is very helpful! May I ask what tests you are referring to?

I was trying to make a remote connection to my robot over wifi. Similarly, I also tried both ends of the 1.0.0 PR (ros2#276) i.e., Rolling and dev/1.0.0. However, both of them are unstable. With your results I can see clearly what's going wrong.

If I skip the navigation i.e., basically running the ROS2 control for the motors and running the lidar along with its filters everything works fine. I can very smoothly visualize the robot remotely. It also properly updates the odometry if I move it with a joystick. But as soon as I try to run nav2 stack, things go wrong.

@evshary
Copy link
Author

evshary commented Nov 8, 2024

Hi @alireza-moayyedi
In fact, the table is just to show the unit test result in the navigation2 repository
On our side, navigation2 works well although not passing all the tests. Perhaps you could describe more about how you run and what the issues you face.
BTW, you might need to ensure the version of nav2 you're using includes the fix here.
ros-navigation/navigation2#4725

@alireza-moayyedi
Copy link

alireza-moayyedi commented Nov 8, 2024

Hi @evshary,

Well that's a surprise to be honest. This is the exact usecase that I am trying work out:

On the robot side:

  • Connect to the wifi
  • Run the a zenoh router on the robot (ros2 run rmw_zenoh_cpp rmw_zenohd)
  • Bringup the robot hardware
  • bringup the nav2 stack (include file="$(find-pkg-share nav2_bringup)/launch/bringup_launch.py") on the robot with some launch arg overwrites such as a slightly modified param file wrt the default config of the nav2_bringup pkg (e.g., laser topics, robot footprint) and the input map; nothing special.

On a separate computer:

  • Connect to the same wifi
  • Set the connect: { endpoints: [... inside DEFAULT_RMW_ZENOH_ROUTER_CONFIG.json5 to the robot's ip
  • Run a zenoh router
  • Run rviz2

Expected behavior:

  • The map is loaded on the separate computer's rviz.
  • Initial pose is given in the separate computer's rviz. The robot localizes and the rest of the stack such as costmaps loads properly in the map frame.
  • A goal pose is given inside the separate computer's rviz and the robot drives towards it.

Actual behavior in rolling:

  • The map is loaded on the separate computer's rviz with some delay.
  • Initial pose is given in the separate computer's rviz. The robot localizes and the rest of the stack such as costmaps loads properly in the map frame. However the behavior is very choppy and timeouts occure continuously.
  • If I move the robot slightly with a joystick it fails to update the frames continuously (laggy and choppy behavior)
  • If I give it a goal pose, it starts driving but in a matter of seconds it fails and aborts goal as it cannot properly update tf frames

Actual behavior in dev/1.0.0:

  • The map is never loaded on the separate computer's rviz. If I echo the topic it just stays empty. I did not try all of the nav2 topics but picked a few and they were empty too.

I am certain that this is an rmw issue because if I connect the separate computer directly with an ethernet cable to the robot and use CycloneDDS with a explicit peers address list and explicit network interface then everything works very smoothly and I can easily initialize and control the robot remotely. Of course the downside then is that I have to follow the robot with my laptop in the hand.

Regarding the release, I am using the latest apt release:

Package: ros-jazzy-nav2-bringup
Version: 1.3.2-1noble.20241015.123150

@evshary
Copy link
Author

evshary commented Nov 12, 2024

Hi @alireza-moayyedi

Thank you for the detailed steps. I didn't see anything weird.
I would suggest doing some experiments (with rmw_zenoh) to narrow the issue down.

  1. Running the nav2 with simulation on the same host.
    • I believe this should run without any issues.
  2. Using Ethernet to connect your robot and computer (Just as you did with CycloneDDS).
    • See whether the issue comes from WiFi or not.

For the dev/1.0.0 version, perhaps you could share the logs with us.
I think the fix I mentioned before hasn't been included in the apt binary, but it's more related to the Rviz plugin crash, which is not the same as your description.

@alireza-moayyedi
Copy link

alireza-moayyedi commented Nov 13, 2024

Hi @evshary,

As suggested I tried to narrow it down furthur and here are my findings (everything run with dev/1.0.0):

  1. Running everything (nav2 + rviz) on the same host:
    • This works fine, whether simulation environment or the real hardware. It works as expected (as long as there is no remote connection interfering)
  2. Using direct Ethernet connection:
    • This also works fine, similar to CycloneDDS with Ethernet

So I guess at this point we can conclude something is going wrong with communicating over wifi. Therefore, I tried to dig deeper. First, to omit the possibility of a faulty office wifi, I set up a separate router (2.4 GHz) where only my computer and the robot connected to it. But still the same issues as I reported originally. Here are some logs that might be relevant:

  • Trying to echo /map from the remote computer:
    Screenshot from 2024-11-13 13-30-06
  • Terminal showing error messages from the host zenoh router (on the robot):
    Screenshot from 2024-11-13 14-07-43
  • Terminal showing errors coming from the remote zenoh router (on my computer)
    Screenshot from 2024-11-13 14-07-53

Next, I connected a display to the robot and I tried to see if I could run rviz simultaneously on both the robot as well as the remote computer and check if there was some difference in the behavior. On the robot I managed to get the map loading in the robot's rviz while the remote computer was still not loading it (though not so easily as I will explain later why). Surprisingly, I noticed that after giving the initial pose in the robot's rviz, amcl started to work properly and in the remote rviz I could also see the topics such as costmaps in the map frame (still no map). I drove around a bit and it seemed stable. Here is the remote rviz showing some topics in the map frame after initializing the localization in the robot's rviz:
Screenshot from 2024-11-13 10-56-05

So then I got more suspicious on the map server and started digging deeper into it. Now as I mentioned earlier, it was difficult to get the map showing in the robot's rviz when I was trying to also visualize it simultaneously in the remote's rviz. I noticed some irregular behavior when I tried to run the rviz first on the remote computer and then run the nav2 stack on the robot. For some reason, it caused the map server not to load properly:
Screenshot from 2024-11-13 11-05-48
Which kind of explained why I had to restart the launches so many times to get the simultaneous rviz loads working. Apparently the order of launching things (rviz remote -> rviz robot -> nav2 robot) was affecting the behavior.

So now in order to make it work, I need to first run nav2 on the robot, initialize the localization on the robot's rviz and only then run the rviz on the remote.

This got me thinking if the /map topic needs some furthur tuning in the zenoh router's configuration to accomodate for the topic's bandwidth. Or maybe this is actually related to the rviz plugin that you mentioned which in that case I should test building nav2 from the source including that fix.

Sorry for the long posts, and I appreciate much your patience. Unfortunately I have not yet found anyone around me who has successfully managed to setup the Zenoh rmw in combination with nav2 for establishing a remote connection. Therefore, I have decided to dig deeper into it myself and report it directly to you here.

@evshary
Copy link
Author

evshary commented Nov 14, 2024

Hi @alireza-moayyedi Thank you for the detailed description. It helps a lot. I will investigate it. Feel free to share with us if there is anything else you find.

@JEnoch
Copy link

JEnoch commented Nov 14, 2024

@alireza-moayyedi you can try to tune the /map topic when using dev/1.0.0 branch via the downsampling configuration. See here a guideline: https://github.com/ZettaScaleLabs/roscon2024_workshop/blob/main/exercises/ex-7.md

If you don't know the topic type name and hash, you can replace each with * characters in the key_expr.
e.g.: key_expr: "0/map/*/*" (assuming ROS_DOMAIN_ID=0 and no namespace is set).

@evshary
Copy link
Author

evshary commented Nov 15, 2024

@alireza-moayyedi
Some updates on my side:
There is no available physical robot in my hand currently, so I tried to reproduce the issue with the simulation.
My environment:

  • Ubuntu 24.04 + ROS 2 Jazzy
  • nav2: Jazzy built from source code
  • rmw_zenoh: dev/1.0.0
  • Two hosts connected with WiFi

1st host:

ros2 run rmw_zenoh_cpp rmw_zenohd
ros2 launch nav2_bringup rviz_launch.py

2nd host:

ros2 run rmw_zenoh_cpp rmw_zenohd
ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False use_rviz:=False

It seems like everything works well. However, I found there is an issue if we run nav2 simulation first and then rviz2.
It also failed even using CycloneDDS. I'm checking whether this is the issue coming from nav2.

@alireza-moayyedi
Copy link

Hi @evshary @JEnoch ,
Some updates from my side:

I tested the scenario @evshary described with the difference of using nav2's apt release. I know that it has the rviz plugin bug which is fixed in source but well it was manageable. Similar to your results, everything worked fine and smoothly between the host and the remote. So it got me wondering what is the difference between my robot and the tb3 simulation. I compared the two map files and I noticed that:

  • tb3_sandbox.pgm is only 147.5 kB
  • my .pgm is 7.9 MB (it is the map of our office so it is pretty large but still, I can imagine there would be nav2 applications running on much larger maps)

Then I thought let's swap the maps and see what happens. So I did the following tests:

  • Ran the nav2_bringup tb3_simulation_launch.py on the host (robot) with my large map as the input and again similar problems; map not loading in the remote rviz, nav2 not starting properly etc.
  • Ran the nav2 and the rest of the robot software on the robot with the tb3_sandbox.pgm as the input. Surprisingly, I noticed that the remote rviz loaded everything properly and the full nav2 stack worked completely fine. Ofcourse the map did not correspond the robot's surrounding but I just wanted to test if it works. Gave it a goal pose on the remote rviz and everything worked as desired.

Screenshot from 2024-11-15 11-42-31

So I am guessing some configuration is not properly set to accomodate for the 7.9 MB map over the wifi since it works fine when using an ethernet connection or keeping everything in the host. I took a look at the default router config and max_message_size: 1073741824 seems to be fine. I wonder what should be changed. @JEnoch Thanks for the tip; I will test your suggestion next. But considering my observation, do you suggest anything else to change?

@alireza-moayyedi
Copy link

alireza-moayyedi commented Nov 15, 2024

@JEnoch follow up of my previous message; tried the downsampling but unfortunately it did not help.

@JEnoch
Copy link

JEnoch commented Nov 15, 2024

Thank you for those detailed tests!
We need now to figure out how rviz get the map from nav2. My guess is the map_server is in charge, right ?
Can you please run this command on your laptop with the robot serving your 7.9 MB map:

RUST_LOG=zenoh=trace ros2 service call /map_server/load_map nav2_msgs/srv/LoadMap "{map_url: /ros/maps/map.yaml} 2>&1 | tee service_call.log"

I guess the map_url has to be changed here.

Also, can you share your 7.9 MB map somewhere so we can test it with the simulation and analyse further ?

@alireza-moayyedi
Copy link

Hello @JEnoch ,

Just to summarize, the problem is not the loading of the map by map server, but rather communicating the map over wifi. In other words:

  • The map server loads the map in any case.
  • The host (robot) has access to the map.
    • map topic can be echoed on the host
    • map topic can be visualized on the host rviz
  • The map is not accessible in the remote
    • map topic cannot be echoed on the remote
    • map topic cannot be visualized on the remote rviz

So to make sure we are not missing anything, I have logged and piped the following as you requested:

  • On the remote:
    • ros2 service call /map_server/load_map
    • ros2 topic echo /map
  • On the host:
    • ros2 service call /map_server/load_map
    • ros2 topic echo /map

I will be sending the outputs as well as our map to you and @evshary by email. Thanks for your efforts!

@evshary Can you by any chance test the tb3 (or any other remote/host setup you have) with our map as the input?

@evshary
Copy link
Author

evshary commented Nov 19, 2024

OK, I managed to reproduce the issue with simpler ROS 2 examples.
We can use the ping & pong here with a larger payload size (10 MB)
https://github.com/ZettaScaleLabs/ros2-simple-performance

  • Host 1: ros2 run simple_performance pong
  • Host 2: ros2 run simple_performance ping --ros-args -p warmup:=1.0 -p size:=10000000 -p samples:=10 -p rate:=1

However, I can use pure Zenoh examples (both Rust and C) to send 100MB payload with the same configuration and environment.
Therefore, I suppose the issue might come from rmw_zenoh.

@alireza-moayyedi
Copy link

Hello @evshary,

Nice! Should this be addressed on a separate repo/issue (I assume https://github.com/ros2/rmw_zenoh/issues)? Or will you look into it yourself?

@evshary
Copy link
Author

evshary commented Nov 19, 2024

Nice! Should this be addressed on a separate repo/issue (I assume https://github.com/ros2/rmw_zenoh/issues)? Or will you look into it yourself?

I haven't verified it with the rolling branch yet, but I will keep looking into it for sure. At least, we wish this can be fixed while upgrading to Zenoh 1.0 in rmw_zenoh.

@evshary
Copy link
Author

evshary commented Nov 25, 2024

@alireza-moayyedi Now we have a branch to fix the issue. It would be great if you could give it a try and let us know whether it works for you. Thank you!
#43

@alireza-moayyedi
Copy link

Hi @evshary,

I spent some time with the robot and tested different scenarios. The map now loads but the performance is very poor. To be more precise:

  • Things work fine if I keep everything only at the robot. In other words, if I do not run an external zenoh router and try to use a remote rviz.
  • If I run a remote rviz, then the behavior degrades noticeably
    • Everything becomes slow
    • A lot of nav2 lifecycle nodes fail to activate such as:
      • /behavior_server
      • /bt_navigator
      • /collision_monitor
      • /docking_server
      • /global_costmap/global_costmap
      • /planner_server
      • /velocity_smoother
      • /waypoint_follower
    • The /map also doesnt load easily on the remote rviz. Sometimes it loads after quite a long time and sometimes it loads if on the robot I try to echo the map. Somehow, echoing it on the robot triggers it to also load remotely.
    • If I stop everything (even the routers) and then try to launch everything again, it becomes even more unstable.
  • If I run the remote rviz first (before running the nav2 stack on the robot), then a different combination of nodes fail to start such as map server or controller server. Hence no map gets loaded in any case then.

@evshary
Copy link
Author

evshary commented Nov 26, 2024

Hi @alireza-moayyedi
That's bad news. Would you mind helping me clarify something?

  1. In your environment, did the simulation with the large map work well with the branch?
  2. Did nav2 lifecycle nodes also fail to activate before the patch? Just want to make sure this is not introduced by our fix.
  3. Is there any other difference before and after the fix?
  4. Is there any chance to test the throughput/latency of your WiFi environment? Perhaps using iperf and ping.

I will see how I can reproduce the issue on my side.

@alireza-moayyedi
Copy link

alireza-moayyedi commented Nov 27, 2024

Hi @evshary ,

I had limited time to check everything you wanted thoroughly so I need to perform more tests later this week again to double check everything. But with a few tests that I did yesterday I observed that:

  1. I couldn't even get the tb3 simulation up. I remember in our earlier discussions we both tested this and it was working fine. So I switched to the latest commit of the dev/1.0.0 branch and also there it was not working properly. When I get back to the robot later this week, I will checkout one of the earlier commits to investigate this. But I have a feeling that something went wrong with the recent commits (I might be wrong though, just a guess, but perhaps you could check this too)
  2. Previously, some lifecycle nodes failed to activate as well which I assume was mainly because of map_server. You can see that in my earlier comment (nav2 test analysis #38 (comment)). With small maps (e.g., tb3 simulation), I could load everything and I could have convinient remote access to the robot and interact with the nav2 stack. However, for larger maps, I couldn't get the map_server up and running so I could never proceed to check the rest of the nodes. So I cannot fairly compare the situation now that the map_server with a large map as input activates but other nodes fail with back when map_server with a large map as input failed to activate.
  3. The main difference for before and after is that the map_server now works and publishes a large map which is accessible also remotely. But the overal performance is somewhat degraded which I think might not be coming from the fix introduced in Enable CongestionControl::Block if QoS is reliable. #43 but rather latest developments (again, just a hypothesis, I need to double check this).
  4. I need to test it with the robot again to give you some exact numbers so I will update you later. But I could already tell you that overall, the latency is pretty low (pings below 10ms) and on the same setup, I already had tb3 simulation with the sandbox map working with remote access.

@evshary
Copy link
Author

evshary commented Nov 27, 2024

Hi @alireza-moayyedi

No worries. We always appreciate the early adopters who give us valuable feedback.

For 1, it's interesting. I will try the latest dev/1.0.0 on my side again, but at least the branch adjust_qos works on my side.
For 2 and 3, let's check the point 1 first. Indeed, some of the latest commits might mess up.
For 4, it should then not be the issue of the WiFi environment. My environment is even worse than yours.

Besides, it would be great if you could share the specs of your robot (I mean the computer on your robot). Now I'm using laptop and IPC to do the simulation, which is powerful enough. Just want to make sure if it's related to the limited resource device.
It's also helpful if you can record a video/gif to let us feel how slow it is 😄

@evshary
Copy link
Author

evshary commented Nov 28, 2024

For 1, it's interesting. I will try the latest dev/1.0.0 on my side again

Okay, I've verified it again. It still works on my side with this commit 435186a

Let's focus on the simulation first and see what the difference is between our environment. Then we can move on to your real robot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants