Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bad rmw_time_t to nanoseconds conversion. #555

Merged
merged 4 commits into from
May 7, 2020

Conversation

hidmic
Copy link
Contributor

@hidmic hidmic commented May 5, 2020

Fixes ros2/ros2cli#503 (i.e. the failure, not the hang). rmw_time_t instances were being forced into an rcutils_time_duration_t value (actually, first into a single uint64_t total nanosecond count, then into an unsigned PY_LONG_LONG, and then into an rcutils_time_duration_t i.e. an int64_t), causing overflows.


This happened in the first place because, RMW graph APIs are not currently bound to return anything reasonable for QoS related durations that do not make sense in a given context. That is the case for data reader (subscribers' backbone) lifespans in rmw_connext_cpp (see here and here), which come with whatever value was in the stack at the moment that rmw_qos_profile_t variable enters scope.

I see there's some discussion in ros2/rmw#210 about introducing some consistency in rmw_time_t values. I'll defer to them on what to do about RMWs in general.

@hidmic
Copy link
Contributor Author

hidmic commented May 5, 2020

FYI @emersonknapp.

@dirk-thomas
Copy link
Member

I don't think I have the necessary context for a meaningful review. It seems wrong to me to just map "invalid" input to some value without any warning / error.

This happened in the first place because, RMW graph APIs are not currently bound to return anything reasonable for QoS related durations that do not make sense in a given context.

Why can't the rmw API not be forced to set reasonable values in these cases?

@hidmic
Copy link
Contributor Author

hidmic commented May 5, 2020

It seems wrong to me to just map "invalid" input to some value without any warning / error.

We can certainly add a warning if overflow would ensue, though that'll add noise for RMWs that can potentially return garbage.

Why can't the rmw API not be forced to set reasonable values in these cases?

I believe there's no consensus as to what would be a reasonable value in some cases. There're some who even think that rmw_time_t doesn't need to exist (see ros2/rmw#215). Zeroing values or defining an INFINITY would make sense to me, but I have not been actively involved in the design to make that call.

In any case, this patch is still necessary.

@jacobperron
Copy link
Member

In any case, this patch is still necessary.

Makes sense to guard against overflow to me, since we're casting from uint64_t to int64_t.

@hidmic hidmic requested review from emersonknapp and removed request for dirk-thomas May 5, 2020 19:29
@dirk-thomas
Copy link
Member

RMWs that can potentially return garbage.

Imo that is simply unacceptable API. Do RMW implementations actually return uninitialized data? If yes, this might open us up to undefined which we should really avoid.

@emersonknapp
Copy link
Contributor

I don't think that they return unitialized values, but we don't currently have a defined constant. fastrtps and connext return one value for "unspecified", and cyclone has some whole other constant that it seems to use for that value. ros2/rmw#210 is an attempt to discuss some ROS2-global constants that we would dictate in the API and expect rmw implementations to use.

Copy link
Member

@dirk-thomas dirk-thomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of working around the problem here the RMW implementations should be updated to return a defined value.

The only logic which should be present here is to check valid boundaries and show an error / raise an exception when that is violated.

@hidmic
Copy link
Contributor Author

hidmic commented May 5, 2020

I don't think that they return unitialized values

That's not exactly true. Graph cache in rmw_connext_cpp gets populated with QoS data here, for which it uses dds_remote_qos_to_rmw_qos which in turn may use an empty dds_qos_lifespan_to_rmw_qos_lifespan implementation. As the first rmw_qos_profile_t instance isn't initialized, nowhere is. The same goes for any QoS getter that doesn't initialize that duration prior to function call.

fastrtps and connext return one value for "unspecified"

Which value? I don't have a strong opinion against zero initialization though.

Instead of working around the problem here the RMW implementations should be updated to return a defined value.

This patch is orthogonal to that change.

The only logic which should be present here is to check valid boundaries and show an error / raise an exception when that is violated.

We can raise an exception if the uint64_t to int64_t conversion is not feasible, sure. But bear in mind that unreasonable values are not invalid values, type wise. We've set ourselves up for a problem by introducing rmw_time_t in the first place IMHO.

@hidmic
Copy link
Contributor Author

hidmic commented May 5, 2020

See ros2/rmw_connext#422 and 0ef805b. It's not a silver bullet for the rmw_time_t <-> rcutils_time_point_t problem, but it's an improvement.

@emersonknapp
Copy link
Contributor

emersonknapp commented May 5, 2020

Which value? I don't have a strong opinion against zero initialization though.

I don't have it up and running in front of me right now, but I was seeing (from ros2/rosbag2#335 (comment))

FastRTPS seems to always return the following when a QoS policy duration was unspecified (e.g. {0, 0} passed in)

sec: 0x7FFFFFFF 
nsec: 0xFFFFFFFF

Cyclone seems to always return the following when a QoS policy duration was unspecified (e.g. {0, 0} passed in) maybe this was an uninitialized value

sec = 9223372036 == 0x225C17D04 
nsec = 854775807 == 0x32F2D7FF

@hidmic
Copy link
Contributor Author

hidmic commented May 6, 2020

CI up to ros2cli, rclpy, and rmw_connext (when possible):

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status (unrelated test failures)

@hidmic hidmic requested a review from dirk-thomas May 6, 2020 13:56
Signed-off-by: Michel Hidalgo <[email protected]>
@dirk-thomas
Copy link
Member

How about an alternative: construct the Python Duration with two arguments - seconds and nanoseconds - as provided by rmw_time_t and instead perform the check if the value matches the bounds of the underlying C type in rclpy_create_duration()?

@hidmic
Copy link
Contributor Author

hidmic commented May 7, 2020

@dirk-thomas PTAL

@dirk-thomas
Copy link
Member

instead perform the check if the value matches the bounds of the underlying C type in rclpy_create_duration()?

Does something need to be changed in this function?

@hidmic
Copy link
Contributor Author

hidmic commented May 7, 2020

Does something need to be changed in this function?

@dirk-thomas No need to. rclpy.Duration will perform the summation and then rclpy_create_duration() will attempt to parse that into a C long. If the latter operation would overflow, it'll raise.

@hidmic hidmic merged commit 3504f36 into master May 7, 2020
@delete-merged-branch delete-merged-branch bot deleted the hidmic/fix-bad-rmw_time-to-ns branch May 7, 2020 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ros2cli daemon test hanging with Connext
4 participants