Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👨‍🌾 Regression in test_play_{timing,services}__rmw_{rmw_vendor} on the buildfarm jobs #862

Open
j-rivero opened this issue Sep 3, 2021 · 16 comments
Labels
bug Something isn't working tests Failing or missing tests

Comments

@j-rivero
Copy link

j-rivero commented Sep 3, 2021

Description

The following tests have started to fail consistently (three days in a row) in the CI of https://ci.ros2.org/job/nightly_linux_repeated/:

  • rosbag2_transport.test_play_services__rmw_fastrtps_cpp
  • rosbag2_transport.test_play_timing__rmw_fastrtps_cpp.gtest

If I'm not wrong, the build displays that the commit used is 891e081 that correspond to the pull #848 .

Expected Behavior

Tests should pass :)

Actual Behavior

Timeout


2_transport/test_results/rosbag2_transport/test_play_timing__rmw_fastrtps_cpp.gtest.xml
30: Running main() from gmock_main.cc
30: [==========] Running 3 tests from 1 test suite.
30: [----------] Global test environment set-up.
30: [----------] 3 tests from PlayerTestFixture
30: [ RUN      ] PlayerTestFixture.playing_respects_relative_timing_of_stored_messages
30: [       OK ] PlayerTestFixture.playing_respects_relative_timing_of_stored_messages (1025 ms)
30: [ RUN      ] PlayerTestFixture.playing_respects_rate
    Test #30: test_play_timing__rmw_fastrtps_cpp .................***Timeout  60.05 sec
test 31
      Start 31: test_play_services__rmw_fastrtps_cpp

31: Test command: /home/jenkins-agent/workspace/nightly_linux_repeated/venv/bin/python3.8 "-u" "/home/jenkins-agent/workspace/nightly_linux_repeated/ws/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_results/rosbag2_transport/test_play_services__rmw_fastrtps_cpp.gtest.xml" "--package-name" "rosbag2_transport" "--output-file" "/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/ament_cmake_gmock/test_play_services__rmw_fastrtps_cpp.txt" "--env" "RMW_IMPLEMENTATION=rmw_fastrtps_cpp" "--command" "/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_play_services__rmw_fastrtps_cpp" "--gtest_output=xml:/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_results/rosbag2_transport/test_play_services__rmw_fastrtps_cpp.gtest.xml"
31: Test timeout computed to be: 60
31: -- run_test.py: extra environment variables:
31:  - RMW_IMPLEMENTATION=rmw_fastrtps_cpp
31: -- run_test.py: invoking following command in '/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport':
31:  - /home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_play_services__rmw_fastrtps_cpp --gtest_output=xml:/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_results/rosbag2_transport/test_play_services__rmw_fastrtps_cpp.gtest.xml
31: Running main() from gmock_main.cc
31: [==========] Running 5 tests from 1 test suite.
31: [----------] Global test environment set-up.
31: [----------] 5 tests from PlaySrvsTest
31: [ RUN      ] PlaySrvsTest.pause_resume
31: [       OK ] PlaySrvsTest.pause_resume (572 ms)
31: [ RUN      ] PlaySrvsTest.toggle_paused
31: [       OK ] PlaySrvsTest.toggle_paused (487 ms)
31: [ RUN      ] PlaySrvsTest.set_rate_good_values
    Test #31: test_play_services__rmw_fastrtps_cpp ...............***Timeout  60.06 sec
test 32

To Reproduce

Check CI job

System (please complete the following information)

  • OS: Ubuntu Focal
  • ROS 2 Distro: nightly
  • Version: latest ros2.repos file
@j-rivero j-rivero added bug Something isn't working tests Failing or missing tests labels Sep 3, 2021
@Blast545
Copy link
Contributor

After #863 was merged, has been failing in 21 out of 35 (60%) of the latest debug builds. The test failing is playing_respects_delay. I haven't checked repeated jobs, but right now it's failing frequently in the nightly debug jobs.

Most recent case:
https://build.ros2.org/view/Rci/job/Rci__nightly-debug_ubuntu_focal_amd64/472/

@Blast545
Copy link
Contributor

Without taking into account the repeated jobs, this one has occurred 13 times in the last 20 days. In the nightly_win_rel, nightly_win_deb, nightly_linux_release, nightly_linux_aarch_release and Rci__nightly-release_ubuntu_jammy_amd64.

Here's a recent reference: https://ci.ros2.org/view/nightly/job/nightly_win_rel/2278/testReport/junit/(root)/rosbag2_transport/test_play_services__rmw_fastrtps_cpp_gtest_missing_result/

It affects all the rmw vendors. Related to #732, but this one is the one that happens more often.

@Blast545 Blast545 changed the title 🧑‍🌾 Regression in test_play_{timing,services}__rmw_fastrtps on linux_repeated 🧑‍🌾 Regression in test_play_{timing,services}__rmw_{rmw_vendor} on the buildfarm jobs Apr 26, 2022
@Blast545 Blast545 changed the title 🧑‍🌾 Regression in test_play_{timing,services}__rmw_{rmw_vendor} on the buildfarm jobs 👨‍🌾 Regression in test_play_{timing,services}__rmw_{rmw_vendor} on the buildfarm jobs Apr 26, 2022
@MichaelOrlov
Copy link
Contributor

@Blast545 I've spent some of my time and tried to make analysis for failures in tests:

 *    - Depth: 1,
 *    - Reliability: Best effort,
 *    - Durability: Volatile,

There are no guarantee that messages will be delivered on transport layer.
Therefore tests likely fails because we are trying to detect clock rate by timestamps between two nearest messages and some of them got lost on transport layer.

Those tests are flaky by design and I am honestly don't know how to rewrite them to be more deterministic.

[ Stack Trace](javascript:hideFailureSummary('id789457-stacktrace'))
C:\ci\ws\src\ros2\rosbag2\rosbag2_transport\test\rosbag2_transport\test_record_services.cpp:93
Value of: pub_manager.wait_for_matched(test_topic_.c_str())
  Actual: false
Expected: true

This is very strange failure which I wouldn't expect to happen.
Basically it failures to detect matching subscription during the 10 seconds. The test and implementation looks good.
What I am not trust to 100% is that test using SingleThreadedExecutor for recorder i.e. for subscriptions which we are expecting in failing check.
Maybe rewriting test with using std::threads directly without SingleThreadedExecutor will help.

@MichaelOrlov
Copy link
Contributor

@Blast545 BTW. I see that rosbag2_transport.test_play_services__rmw_fastrtps_cpp periodically fails on CI.
Here is one of the examples https://build.ros2.org/job/Hpr__rosbag2__ubuntu_jammy_amd64/1/testReport/junit/(root)/projectroot/test_play_services__rmw_fastrtps_cpp/
I am curious about if it fails only with rmw_fastrtps ?

I've tried to make a brief analysis of the failure.
I see that failure happens in play_next_response = successful_call(cli_play_next_);
on 195 iteration out of 200. And failure repeats twice with the same numbers.
It looks like 195 is some magic number specific for the fastrtps DDS. Maybe max number of some internal resources.

play_next_response = successful_call(cli_play_next_); consistently fails in 101 line.
EXPECT_EQ(future.wait_for(service_call_timeout_), std::future_status::ready);

It would be actually better to ask someone from FAST RTPS support team to look on this failure.

Meanwhile to mitigate this failure I would suggest to try to increase const std::chrono::seconds service_call_timeout_ {1}; up to 3 seconds.
And decrease number of messages to publish const size_t num_msgs_to_publish_ = 200; to 150

@Blast545
Copy link
Contributor

Thanks for digging into this! @MichaelOrlov

Yeah, the rosbag2_transport.test_play_services__rmw_fastrtps_cpp is the one flaky remaining affecting us the most. I have seen it in various rosbag2_transport PRs.

On Linux it happens only for rmw_fastrtps, in case it rings a bell, it only occurs on release builds.
Rci__nightly-release_ubuntu_jammy_amd64#74)
nightly_linux_release#2276
nightly_linux-rhel_release#1127

It happens on other distributions for `rmw_cyclonedds`, but it's not a common scenario

On the nightly_win_rep it only fails with rmw_cyclonedds:
https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/

On the nightly_linux_repeated it could fail either only rmw_fastrtps (more common) or both like:
https://ci.ros2.org/view/nightly/job/nightly_linux_repeated/2642

And there's an ultra rare scenario where it fails for the three rmw vendors on nightly_win_deb and nightly_win_rel:
https://ci.ros2.org/view/nightly/job/nightly_win_rel/2278/
https://ci.ros2.org/view/nightly/job/nightly_win_deb/2334/

I will open the PR with your suggestion tomorrow morning @MichaelOrlov and get more feedback there as needed.
@clalancette should we ping someone in particular from the fastrtps team to take a look?

@MichaelOrlov
Copy link
Contributor

@Blast545 I see how test fails on rmw_cyclonedds from your link https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/.
It fails in another test toggle_pause. With just few iterations for service call.
And seems that timeout in 1 second for service call is not enough for some cases. Probably with heavy loaded HW or network stack.

Let's try to increase [const std::chrono::seconds service_call_timeout_ {1}; up to 5 seconds and see if CI will only fail on fastrtps on 195 iteration.
If this is will be the case, than make sense to call someone from fastrtps team to take a look.

@MichaelOrlov
Copy link
Contributor

@Blast545 @clalancette I have a good news about this annoying failing test_play_services test.

First of all I was able to reproduce it locally with some extra load for my machine.
I loaded my computer with stress -m 60 command. Please note I was able to reproduce the same failure for both rmws FastRTPS and CycloneDDS. And not only for play_next test. Basically similar what we have seen on CI.

The second good news I found a breaking PR and commit:
This is Update client API to be able to remove pending requests rclcpp#1734 and relevant commit ros2/rclcpp@679fb2b

I've tried to revert commit ros2/rclcpp@679fb2b locally and failure doesn't reproduce any more.

@ivanpauno Could you please pick up further analysis of the failing https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/ from this point since you was an author of the breaking commit ros2/rclcpp@679fb2b

@ivanpauno
Copy link
Member

@ivanpauno Could you please pick up further analysis of the failing https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/ from this point since you was an author of the breaking commit ros2/rclcpp@679fb2b

Could you summarize the analysis you have done up to now?

The problem seems to be a race.
Maybe ros2/rclcpp@679fb2b introduced a race (though that seems to be unlikely based on the code changes), but it might be a pre-existing bug that became more likely to happen after the commit.

@MichaelOrlov
Copy link
Contributor

Hi @ivanpauno Sorry my late response.
My preliminary analysis is following:

  1. We have observing subset of tests failure or found them to be flaky. Those tests are sort of "integration tests" in rosbag2_trasport package and involve sending request and receiving responses via service calls. In particular affected tests in test_play_services.
  2. I've found out that all failing tests has a common design that they expecting to receive response back from "server" as an answer on the request. For instance in play_next or is_paused requests which are supposed to return boolean value.
  3. The failure happening in successful_call(..) when we are not getting value from future due to the timeout. I've tried to increase those timeout up until 30 seconds but doesn't help, observing the same failure. It's difficult to analyse this failure since the response simply doesn't arrive and it happens not so often.
  4. I've tried to optimize unit tests and add some debug info to exclude some possible overlap with another use cases and flows. I also thoroughly reviewed source code of the rosbag2 which is responsible for sending response and found no issues everything is clear and well written. Also I've tried to run valgrind with those failing tests and haven't found anything which could cause memory corruption.
  5. In most cases failure happening with rmw_fastrtps, but we have seen some cases when the same failure happening on rmw_cyclonedds for example here https://pipelines.actions.githubusercontent.com/serviceHosts/af0a8eef-e408-4986-ae27-20f00fbcb6f9/_apis/pipelines/1/runs/26054/signedlogcontent/2?urlExpires=2022-06-23T00%3A53%3A29.4599815Z&urlSigningMethod=HMACV1&urlSignature=K%2FQgv3z8x4SAtVeznmXFKGdg05Bur8EcI5VCv4ZaJl0%3D search for test_play_services__rmw_cyclonedds_cpp.
  6. I was able to reproduce those failure locally by loading my local machine with another tasks via command stress -m 60.
  7. I found out that without ros2/rclcpp@679fb2b tests performs much faster with heavy load on the machine. It seems ros2/rclcpp@679fb2b also introduce performance regression for services.

Please let me know if you need more information or details about this issue or if something unclear.

@fujitatomoya
Copy link
Contributor

could be related to ros2/rmw_fastrtps#616.

@MichaelOrlov
Copy link
Contributor

could be related to ros2/rmw_fastrtps#616.

@fujitatomoya Unlikely, since in test_play_services we are not sending many requests in burst.
We are sending service requests one by one and each time verifying that we are getting corresponding response from the "server" before sending the next request.
And at some iteration we just loosing response.

@fujitatomoya
Copy link
Contributor

@MichaelOrlov thanks for the comment. i wasnt sure, just came up to mind.

@ivanpauno
Copy link
Member

ivanpauno commented Jul 15, 2022

I was able to reproduce the issue, I don't fully understand why it happens yet (and it's pretty hard to reproduce it).
I will post here if I have further updates.

@MichaelOrlov
Copy link
Contributor

@MichaelOrlov
Copy link
Contributor

Could be related to the ros2/rclcpp#2039, eProsima/Fast-DDS#3087 and osrf/docker_images#628

@MichaelOrlov
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tests Failing or missing tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants