Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_parameter_server_cpp times out #420

Closed
jacobperron opened this issue Apr 13, 2020 · 8 comments
Closed

test_parameter_server_cpp times out #420

jacobperron opened this issue Apr 13, 2020 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@jacobperron
Copy link
Member

jacobperron commented Apr 13, 2020

Bug report

Required Info:

  • Operating System:
    • All platforms
  • Installation type:
    • source
  • Version or commit hash:
    • master
  • DDS implementation:
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Here is an example failure:

https://ci.ros2.org/view/nightly/job/nightly_osx_repeated/1918/testReport/junit/(root)/projectroot/test_parameter_server_cpp__rmw_fastrtps_cpp/

Note, the "failed to set parameter" messages are expected output of the test.

I'm able to reproduce locally on Linux (note the test is flakey, but happens more often than not):

colcon test --packages-select test_rclcpp --ctest-args -R "test_parameter_server_cpp__rmw_fastrtps_cpp" --event-handlers console_direct+

Expected behavior

Tests pass.

Actual behavior

Tests fail (timeout)

Additional information

With #419 checked out, we can see that the different tests hang at different points in the execution.

With additional print statements, I've been able to determine that the timeouts occur due to the parameter client waiting forever for a service response (GetParameters.srv, SetParameters.srv, or ListParameters.srv). It looks to me that the service request always makes it to the service server (which succesfully makes the rcl call to send the response), but the response sometimes never reaches the client.

So far, I've only been focused on Fast-RTPS, which is the RMW we see failing on the nightlies.

Also, I've only seen the *_sync tests fail. I have not witnessed a test with an AsyncParametersClient fail to get a response.

@jacobperron jacobperron added the bug Something isn't working label Apr 13, 2020
@jacobperron
Copy link
Member Author

jacobperron commented Apr 13, 2020

This reminded me of an outstanding issue we have with services: ros2/demos#304

Indeed, I can reproduce what's happening in the test_parameters_server_cpp server with our demos.

Start a server:

ros2 run demo_nodes_cpp add_two_ints_server

And repeatedly send requests:

ros2 run demo_nodes_cpp add_two_ints_client

Every time the server receives the request. But only some of the time does the client get the response.

I can also reproduce this issue with CycloneDDS, although it happens much less frequently.

@ivanpauno
Copy link
Member

@hidmic
Copy link
Contributor

hidmic commented May 1, 2020

@ivanpauno I'd think issues in CycloneDDS are related to ros2/rmw_cyclonedds#74.

@ivanpauno
Copy link
Member

I originally thought the same, but then I discovered ros2/rclcpp#1101.
It's still unrelated to the fastrtps failures, and what it's described in ros2/rmw_cyclonedds#74 is still a possible race condition.

@hidmic
Copy link
Contributor

hidmic commented May 1, 2020

Alright, it's not a race in SPDP comms. And the service reply change set is being sent and arriving.

I can't tell what the root cause of the problem is yet, but each and every time the client hangs, the WriterProxy::received_change_set call on the reply change set shows:

seq_num > max_sequence_number > changes_from_writer_low_mark_

It'd almost seem like the max_sequence_number change set never arrives, causing changes_from_writer_low_mark_ to remain the same and WriterProxy::next_cache_change_to_be_notified to return SequenceNumber_t::unknown() from then on. We never get the notification that the reply message arrived.

One interesting and annoying detail is that hooking up gdb makes this go away, so it's definitely a race. I'll keep digging.

cc @richiware @MiguelCompany

@MiguelCompany
Copy link

@hidmic Thanks for the in-deep analysis

This seems like the same issue we were trying to solve with eProsima/Fast-DDS#1190 which, in particular,

Fixes late-joining reliable volatile readers never notifying samples on some cases

@hidmic
Copy link
Contributor

hidmic commented May 4, 2020

@MiguelCompany indeed. I can no longer reproduce. after cherry-picking those changes into 1.10.x. Any chance to backport it? CC @jacobperron as we may need to re-release some packages. We updated to 2.0.x last week, nevermind the previous sentence.

@hidmic
Copy link
Contributor

hidmic commented May 5, 2020

Considering that recent repeated OSX nightlies no longer show this flake (https://ci.ros2.org/view/nightly/job/nightly_osx_repeated/1941 or https://ci.ros2.org/view/nightly/job/nightly_osx_repeated/1943 for instance), I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants