Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🧑‍🌾 Flaky test demo_nodes_cpp.TestExecutablesTutorial.test_processes_output failing variations from connextdds #693

Open
Crola1702 opened this issue May 22, 2024 · 4 comments
Assignees

Comments

@Crola1702
Copy link

Crola1702 commented May 22, 2024

Bug report

Required Info:

  • Operating System:
    • Linux, Windows and RHEL
  • Installation type:
    • Source
  • Version or commit hash:
    • Rolling
  • DDS implementation:
    • ConnextDDS

Steps to reproduce issue

  1. Run a build in a Linux or Windows nightly job
  2. See the test regression fail

Description

There is a parent test regression with different variations on ConnextDDS: demo_nodes_cpp.TestExecutablesTutorial.test_processes_output

Failing test regressions:

Log output (test_tutorial_parameter_events_async__rmw_connextdds):
FAIL: test_processes_output (demo_nodes_cpp.TestExecutablesTutorial.test_processes_output)
Test all processes output against expectations.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/build/demo_nodes_cpp/test_parameter_events_async__rmw_connextdds_Debug.py", line 74, in test_processes_output
    proc_output.assertWaitFor(
  File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/launch_testing/lib/python3.12/site-packages/launch_testing/io_handler.py", line 146, in assertWaitFor
    assert success, 'Waiting for output timed out'
           ^^^^^^^
AssertionError: Waiting for output timed out
Log output (test_tutorial_parameter_events__rmw_connextdds):
FAIL: test_processes_output (demo_nodes_cpp.TestExecutablesTutorial)
Test all processes output against expectations.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\ci\ws\build\demo_nodes_cpp\test_parameter_events__rmw_connextdds_Release.py", line 74, in test_processes_output
    proc_output.assertWaitFor(
  File "C:\ci\ws\install\Lib\site-packages\launch_testing\io_handler.py", line 146, in assertWaitFor
    assert success, 'Waiting for output timed out'
AssertionError: Waiting for output timed out

Flakiness report (projectroot.test_tutorial_parameter_events_async__rmw_connextdds):

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_linux_debug 2024-05-22 2024-05-09 15 7 46.67
nightly_linux_repeated 2024-05-19 2024-05-08 16 2 12.5
nightly_win_deb 2024-05-15 2024-05-07 9 5 55.56

Flakiness report (projectroot.test_tutorial_parameter_events__rmw_connextdds):

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_win_rep 2024-05-22 2024-05-07 13 13 100.0
nightly_win_rel 2024-05-22 2024-05-22 14 1 7.14
nightly_linux-rhel_repeated 2024-05-21 2024-05-12 16 3 18.75
nightly_win_deb 2024-05-12 2024-05-09 9 2 22.22

I don't see any specific change that points to a reason why it started failing more in normal jobs, and not only repeated ones (package history).

projectroot.test_tutorial_parameter_events_async__rmw_connextdds

  • In linux debug started happening since March 31 build 3012
  • In linux repeated started happening since April 23 build 3433
  • In windows debug started happening since March 29 build 3047

projectroot.test_tutorial_parameter_events__rmw_connextdds:

  • In windows repeated this tests started happening since parallel testing (reference build) in Feb 25 and has been happening almost consistently since.
  • In windows release happened today only
  • In windows debug started happening since Feb 29 build 3018 (just after parallel testing)
  • In rhel repeated it started happening since March 8 build 1859
@fujitatomoya
Copy link
Collaborator

I did take a look at this flaky test issue with connextdds.

1st, this cannot be reproducible with my local dev environment...

2nd, 30 seconds should be long enough to receive the parameter events,

https://github.com/ros2/demos/blob/rolling/demo_nodes_cpp/test/test_executables_tutorial.py.in#L74-L78

both cases https://ci.ros2.org/view/nightly/job/nightly_linux_debug/3064/testReport/junit/(root)/projectroot/test_tutorial_parameter_events_async__rmw_connextdds/ and https://ci.ros2.org/view/nightly/job/nightly_win_rep/3362/testReport/junit/(root)/projectroot/test_tutorial_parameter_events__rmw_connextdds/ missed the last 2 events below.

Parameter event:
new parameters:
changed parameters:
foo
deleted parameters:
Parameter event:
new parameters:
changed parameters:
bar
deleted parameters:

3rd, QoS for parameter events is reliable enough, https://github.com/ros2/rmw/blob/22f59f8931944999864ef3b0d7aa75ab7258f028/rmw/include/rmw/qos_profiles.h#L77-L88

after all, i could not find why last 2 events are missing here only connextdds, probably connextdds misses those messages already.

@Crola1702
Copy link
Author

Flaky ratio for demo_nodes_cpp.TestExecutablesTutorial.test_processes_output:

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_win_rep 2024-09-08 2024-08-25 13 10 76.92
nightly_win_deb 2024-09-08 2024-08-25 16 10 62.5
nightly_win_rel 2024-09-03 2024-09-03 16 1 6.25
nightly_linux_repeated 2024-09-03 2024-09-03 9 1 11.11

Flaky ratio of projectroot.test_tutorial_parameter_events_async__rmw_connextdds:

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_win_deb 2024-09-08 2024-08-25 16 8 50.0
nightly_win_rep 2024-09-06 2024-08-31 13 2 15.38

Flaky ratio projectroot.test_tutorial_parameter_events__rmw_connextdds:

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_win_rep 2024-09-08 2024-08-25 13 9 69.23
nightly_win_deb 2024-09-08 2024-08-26 16 5 31.25
nightly_linux_repeated 2024-09-03 2024-09-03 9 1 11.11

@clalancette
Copy link
Contributor

I poked at this a bit, and while I'm not 100% sure of this, I think this was caused by ros2/rclcpp#2142 . At least, if I checkout a workspace back before that, I can't make it happen anymore.

There is no way to revert that change at this point, so we'll have to do some additional poking at the executors and see what we can find here.

@Crola1702
Copy link
Author

Friendly ping @clalancette

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants