Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cyclone DDS fails nightly CI tests #2043

Closed
ruffsl opened this issue Oct 14, 2020 · 12 comments
Closed

Cyclone DDS fails nightly CI tests #2043

ruffsl opened this issue Oct 14, 2020 · 12 comments

Comments

@ruffsl
Copy link
Member

ruffsl commented Oct 14, 2020

Nightly CI tests are failing for Cyclone DDS, as the release_test-rmw_cyclonedds_cpp job has never passed in the history of it's addition to the nightly CI workflow. It would be nice to get this fixed so the navigation2 project could firewatch and support more alternative RMW implementations.

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Version or commit hash:
  • DDS implementation:
    • rmw_cyclonedds_cpp

Steps to reproduce issue

git clone https://github.com/ros-planning/navigation2.git
cd navigation2
docker build --pull -t nav2:latest .
mkdir -p /tmp/overlay_ws/log
docker run -it --rm \
    -v /tmp/overlay_ws/log:/opt/overlay_ws/log \
    nav2:latest \
    bash
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
colcon test
colcon test-result --verbose
cd /tmp/overlay_ws/log/latest_test/

Expected behavior

https://app.circleci.com/pipelines/github/ros-planning/navigation2/4004/workflows/8ebf1205-a43d-4fbe-bdc3-f15c6c605e05/jobs/15392

Actual behavior

https://app.circleci.com/pipelines/github/ros-planning/navigation2/4004/workflows/8ebf1205-a43d-4fbe-bdc3-f15c6c605e05/jobs/15391

Additional information

batman signal to rmw_cyclonedds_cpp maintainers: 📢

CC @eboasson @rotu @hidmic @ivanpauno

@joespeed
Copy link

@k0ekk0ek @dennis-adlink @rotu help please for @ruffsl

@ivanpauno
Copy link

ivanpauno commented Oct 14, 2020

It would be great if a more "minimal" example can be provided.

My wild guess is that this can be related to ros2/rmw_cyclonedds#74, ros2/rmw_cyclonedds#191.

If a service request is "lost" (because of service discovery issues), that could cause the error output seen in circleci.
(the lifecycle manager is using services heavily)

Relevant part of circle CI output:

[test_lifecycle_node_gtest-2] [WARN] [1602681393.619307369] [rcl_lifecycle]: No transition matching 4 found for current state unconfigured
[test_lifecycle_node_gtest-2] [ERROR] [1602681393.619319386] []: Unable to start transition 4 from current state unconfigured: Transition is not registered., at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl_lifecycle/src/rcl_lifecycle.c:324
[lifecycle_manager-1] [ERROR] [1602681393.619515864] [lifecycle_manager_test]: Failed to change state for node: lifecycle_node_test
[lifecycle_manager-1] [INFO] [1602681393.619530254] [lifecycle_manager_test]: ?[34m?[1mCleaning up lifecycle_node_test?[0m?[0m
[test_lifecycle_node_gtest-2] [WARN] [1602681393.619593119] [rcl_lifecycle]: No transition matching 2 found for current state unconfigured
[test_lifecycle_node_gtest-2] [ERROR] [1602681393.619608354] []: Unable to start transition 2 from current state unconfigured: Transition is not registered., at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl_lifecycle/src/rcl_lifecycle.c:324

@ruffsl
Copy link
Member Author

ruffsl commented Oct 14, 2020

It would be great if a more "minimal" example can be provided.

It'd be nice to get all 4 of the failing tests working, but as you've noted, beginning with LifecycleClientTest would be helpful.
You can look through the artifacts tab to find the record test results and logs. E.g:

10: [lifecycle_manager-1] [ERROR] [1602681388.886669965] [lifecycle_manager_test]: CRITICAL FAILURE: SERVER bond_tester IS DOWN after not receiving a heartbeat for 4000 ms. Shutting down related nodes.
...
10: [lifecycle_manager-1] [ERROR] [1602681391.498473821] [lifecycle_manager_test]: Server bond_tester was unable to be reached after 4.00s by bond. This server may be misconfigured.
10: [lifecycle_manager-1] [ERROR] [1602681391.498547490] [lifecycle_manager_test]: Failed to bring up all requested nodes. Aborting bringup.

https://github.com/ros-planning/navigation2/blob/3393c0adb19e66e2bf2d655fd3dba5f698e1193c/nav2_lifecycle_manager/test/test_lifecycle_manager.cpp#L82

@eboasson
Copy link

So far I'm not having much luck in reproducing locally: my "docker build" step fails because of errors building the gazebo_ros test_plugins. I'll try to get past that, because I really want to know the root cause now that I am back from vacation.

Anyway, while I agree with @ivanpauno's observation, by design there should be only a single way for that particular failure to occur if the application code is well-behaved (i.e., service doesn't get destroyed when it still has a request to service), and that is a time-out on the checking on the service side. If someone happens to know whether the service's attempt at publishing the response reports an error or not, that'd be helpful. Otherwise, I'm sure I'll eventually reproduce it and will be able to see for myself.

I think it is quite likely that the Cyclone trace files should contain enough data to determine what exactly happened (regardless of whether the service ran into an error). Once I can reproduce it, I'll be able to get those. In principle, anyone who can reproduce the issue can get them, all it takes is configuring the tracing, but with many tests and many executables, it gets messy quickly, and I don't want to ask that from others.

@eboasson
Copy link

eboasson commented Nov 2, 2020

Does anyone know what version of gtest I am supposed to use?

With an up-to-date Ubuntu 20.04, gtest-dev 1.10.0-2, up-to-date Foxy installed on it and following the reproduction process described here, as well as with last night's https://raw.githubusercontent.com/ros2/ros2/master/ros2.repos and a full source build, I'm running into build failures in the tests. (The gazebo_ros ones I have simply skipped over, but that, it seems to me, won't do for the system tests of nav2 itself):

In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:67,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/spin/test_spin_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest-param-test.h:496:38: error: ‘constexpr bool testing::internal::InstantiateTestCase_P_IsDeprecated()’ is deprecated: INSTANTIATE_TEST_CASE_P is deprecated, please use INSTANTIATE_TEST_SUITE_P [-Werror=deprecated-declarations]
  496 |   static_assert(::testing::internal::InstantiateTestCase_P_IsDeprecated(), \
      |                                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest-param-test.h:496:38: note: in definition of macro ‘INSTANTIATE_TEST_CASE_P’
  496 |   static_assert(::testing::internal::InstantiateTestCase_P_IsDeprecated(), \
      |                                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:62,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/spin/test_spin_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/internal/gtest-internal.h:1210:16: note: declared here
 1210 | constexpr bool InstantiateTestCase_P_IsDeprecated() { return true; }
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:67,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/spin/test_spin_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest-param-test.h:496:73: error: ‘constexpr bool testing::internal::InstantiateTestCase_P_IsDeprecated()’ is deprecated: INSTANTIATE_TEST_CASE_P is deprecated, please use INSTANTIATE_TEST_SUITE_P [-Werror=deprecated-declarations]
  496 |   static_assert(::testing::internal::InstantiateTestCase_P_IsDeprecated(), \
      |                                                                         ^
/home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/spin/test_spin_recovery_node.cpp:79:1: note: in expansion of macro ‘INSTANTIATE_TEST_CASE_P’
   79 | INSTANTIATE_TEST_CASE_P(
      | ^~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:62,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/spin/test_spin_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/internal/gtest-internal.h:1210:16: note: declared here
 1210 | constexpr bool InstantiateTestCase_P_IsDeprecated() { return true; }
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:67,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/spin/test_spin_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest-param-test.h:496:73: error: ‘constexpr bool testing::internal::InstantiateTestCase_P_IsDeprecated()’ is deprecated: INSTANTIATE_TEST_CASE_P is deprecated, please use INSTANTIATE_TEST_SUITE_P [-Werror=deprecated-declarations]
  496 |   static_assert(::testing::internal::InstantiateTestCase_P_IsDeprecated(), \
      |                                                                         ^
/home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/spin/test_spin_recovery_node.cpp:79:1: note: in expansion of macro ‘INSTANTIATE_TEST_CASE_P’
   79 | INSTANTIATE_TEST_CASE_P(
      | ^~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:62,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/spin/test_spin_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/internal/gtest-internal.h:1210:16: note: declared here
 1210 | constexpr bool InstantiateTestCase_P_IsDeprecated() { return true; }
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:67,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/wait/test_wait_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest-param-test.h:496:38: error: ‘constexpr bool testing::internal::InstantiateTestCase_P_IsDeprecated()’ is deprecated: INSTANTIATE_TEST_CASE_P is deprecated, please use INSTANTIATE_TEST_SUITE_P [-Werror=deprecated-declarations]
  496 |   static_assert(::testing::internal::InstantiateTestCase_P_IsDeprecated(), \
      |                                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest-param-test.h:496:38: note: in definition of macro ‘INSTANTIATE_TEST_CASE_P’
  496 |   static_assert(::testing::internal::InstantiateTestCase_P_IsDeprecated(), \
      |                                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:62,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/wait/test_wait_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/internal/gtest-internal.h:1210:16: note: declared here
 1210 | constexpr bool InstantiateTestCase_P_IsDeprecated() { return true; }
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:67,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/wait/test_wait_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest-param-test.h:496:73: error: ‘constexpr bool testing::internal::InstantiateTestCase_P_IsDeprecated()’ is deprecated: INSTANTIATE_TEST_CASE_P is deprecated, please use INSTANTIATE_TEST_SUITE_P [-Werror=deprecated-declarations]
  496 |   static_assert(::testing::internal::InstantiateTestCase_P_IsDeprecated(), \
      |                                                                         ^
/home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/wait/test_wait_recovery_node.cpp:83:1: note: in expansion of macro ‘INSTANTIATE_TEST_CASE_P’
   83 | INSTANTIATE_TEST_CASE_P(
      | ^~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:62,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/wait/test_wait_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/internal/gtest-internal.h:1210:16: note: declared here
 1210 | constexpr bool InstantiateTestCase_P_IsDeprecated() { return true; }
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:67,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/wait/test_wait_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest-param-test.h:496:73: error: ‘constexpr bool testing::internal::InstantiateTestCase_P_IsDeprecated()’ is deprecated: INSTANTIATE_TEST_CASE_P is deprecated, please use INSTANTIATE_TEST_SUITE_P [-Werror=deprecated-declarations]
  496 |   static_assert(::testing::internal::InstantiateTestCase_P_IsDeprecated(), \
      |                                                                         ^
/home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/wait/test_wait_recovery_node.cpp:83:1: note: in expansion of macro ‘INSTANTIATE_TEST_CASE_P’
   83 | INSTANTIATE_TEST_CASE_P(
      | ^~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/gtest.h:62,
                 from /home/erik/navigation2_ws/src/navigation2/nav2_system_tests/src/recoveries/wait/test_wait_recovery_node.cpp:16:
/home/erik/ros2_ws/install/gtest_vendor/src/gtest_vendor/include/gtest/internal/gtest-internal.h:1210:16: note: declared here
 1210 | constexpr bool InstantiateTestCase_P_IsDeprecated() { return true; }
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

(P.S. Does anyone think it is reasonable for a ROS2 build to take 25GB!? Not everyone has tons of space available in their VMs on their laptops ... and a native macOS build appears to be "non-trivial" as well.)

@ivanpauno
Copy link

With an up-to-date Ubuntu 20.04, gtest-dev 1.10.0-2

From the logs, it seems that the gtest version being used is the one shipped with ROS 2, but that's 1.10 too (it was recently updated).

We were using 1.8 before, and some things were deprecated between both gtest versions.
I guess nav2 wasn't updated to avoid those warnings yet.

Does anyone think it is reasonable for a ROS2 build to take 25GB!?

Yeah, it sounds like a lot, but I'm not sure if that can be improved.

and a native macOS build appears to be "non-trivial" as well

I'm not a macOS user, but I agree that we require some "non-trivial" steps (like disabling SIP).

@ruffsl
Copy link
Member Author

ruffsl commented Nov 2, 2020

Does anyone know what version of gtest I am supposed to use?

It looks like there is an upstream build issue related to gtest: ros-simulation/gazebo_ros_pkgs#1183

(P.S. Does anyone think it is reasonable for a ROS2 build to take 25GB!?

ROS has a rather large set of build dependencies if you are building from scratch. It's not so much an issue if you are building downstream packages from released binaries/libs. You could try and build the project using foxy instead of ROS2 nightly:

docker build --pull -t nav2:latest \
    --build-arg=FROM_IMAGE=ros:foxy .

@dpotman
Copy link

dpotman commented Nov 9, 2020

The test failure in test_lifecycle_manager.cpp:87 seems to be caused by bad timing in the test. An is_active request is sent, and after 1us a time-out is expected. From the Cyclone trace logs I see that the client receives the is_active reponse (obviously after more that 1us), but there is no guarantee that the sleep is not more than 1us due to task scheduling on the test server. So after the 1us sleep the client may have received the reply already, which causes the test to fail. This also happens for the same test using Fast RTPS, see e.g. https://app.circleci.com/pipelines/github/ros-planning/navigation2/4166/workflows/9fc14d29-df30-4515-ab27-9a541684eed1/jobs/15789

The test failure in costmap_downsample_test is caused by creating 2 topics with topic name 'unused_topic' and a different type: nav2_msgs::msg::Costmap for the subscriber and nav_msgs::msg::OccupancyGrid for the publishers. This is not allowed in the DCPS specification and causes an error in dds_create_topic in Cyclone, and can easily be fixed by renaming these topics in this test case.

@SteveMacenski
Copy link
Member

SteveMacenski commented Nov 9, 2020

Keep in mind our PR builder uses the fast-rtps, and this is an occasionally occurring, but rare flaky test. However on Cyclone, this appears deterministically. So from an external user perspective, Cyclone has a problem since the behavior isn't consistent across both of them where they should be consistent. The other option is that fast-rtps doesn't handle something properly that Cyclone does and why that occurs, but some notes on how to resolve would be appreciated. The costmap downsampler you pretty clearly laid out how to do it, but for the lifecycle manager, I don't see a clear solution.

Any PRs to resolve these would also be appreciated - especially since most nav2 users are on Cyclone

@ivanpauno
Copy link

but for the lifecycle manager, I don't see a clear solution.

You can swap L86 and L87-89, that should make the test pass reliably.

@SteveMacenski
Copy link
Member

#2080

@SteveMacenski
Copy link
Member

https://app.circleci.com/pipelines/github/ros-planning/navigation2/4179/workflows/9d63e472-bf4c-474c-a191-7bd3de64a04f/jobs/15829

First ever Cyclone successful CI job, thanks @dennis-adlink @eboasson for looking into it for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants