-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate adding in explicit IPC for Nav2 nodes again #4691
Comments
So during the Jazzy development cycle, we ended up merging in ros2/rclcpp#2303 , which added Still, if transient_local functionality is enough for your use-case, you could consider turning it on. There are 2 different ways to turn it on:
The intra-process manager is fairly sophisticated. If a publisher and subscription on the same topic with the same type are created within the same Context, and intra-process is turned on for both of them, then it will communicate between them using only intra-process. If any one of those conditions is false, it won't use it. However, even in the case where intra-process is used, it still always creates a DDS entity for the outside world to discover. That's so that things like All of that is to say is that the intra-process manager will do the correct thing in all situations, though the efficiency goes down somewhat in situations other than a single intra-process publisher and single intra-process subscription. |
Are the 'main' 4 (history, durability, depth, reliability) fully covered? We use throughout Nav2 all the options for those four in various places, but not sensitive about the deadline, lifespan, liveliness, or lease duration (yet, at least).
I believe then I was either initially misinformed or something has changed since I last looked -- was at any point it true that if a subscription was outside of IPC (intra-), it changed all subscriptions to be inter-process? Follow up question: If we have 2 subscriptions, one intra-process and one inter-process, is the performance any worse on the system by enabling (intra-) IPC than if we did not at all? I assume not, but your the wording of that response makes me want to clarify if there's any reason not to always enable IPC now. It also used to be true that IPC (intra-) would throw if a sub/pub was using QoS that IPC couldn't handle. Is that still the case or can we enable IPC (intra-) globally now in the software and it'll use where it can, but default back to inter-process (though still with composition benefits) where it cannot? Thanks for the follow up :-) It looks like IPC has made some great gains since the last I looked at this in the early Foxy days. We got such good performance boosts from Composition and the QoS would throw for the non-default profile, we weren't able to move further. It sounds like we might be able to fully embrace IPC now across the board for another joyous performance boost! |
I've started on this today and ran into some problems. I filed a ticket for better logging on IPC-specific errors since its really difficult to find where they're coming from in applications with 20+ subscribers and publishers ros2/rclcpp#2703 I believe the system default QoS doesn't actually set the depth size to > 0 which causes a problem with IPC. I think that should probably be updated to force it to be > 0 so that folks don't run into this issue. Feels like a not-awesome user experience to have to manually adjust this (
which I'm not 100% sure what is going on there, since I thought Jazzy/Rolling supported transient_local QoS now (on Rolling for this work). It appears to be coming from the following, which is interesting since its not even transient local QoS. Removing this line or changing to QoS to a non-default-QoS modified setting makes the warning disappear. Though, there is some transient_local QoS immediately after this line (see collapsed snippet below). Code Snippet
Error's coming from here: https://github.com/ros2/rclcpp/blob/8c0161a07f1a44db023650d10f1b2e581c64e1f1/rclcpp/src/rclcpp/intra_process_manager.cpp#L48 The buffer on face value looks like it should be valid from https://github.com/ros2/rclcpp/blob/rolling/rclcpp/include/rclcpp/publisher.hpp#L168-L172. I don't see anywhere in Here's my working branch https://github.com/ros-navigation/navigation2/tree/ipc |
Moving past that since its not blocking, though I found the root cause, I generally see things working! The only major issue is that the transient local topics that are only published once are not responding to late-joining subscriptions to transfer information. For example, the map is only published once on startup as transient local, but rviz doesn't consistently get it (depending if it was up in time to receive it) and Is transient local publication tested to work on late joining subscriptions in Rolling/Jazzy? I looks like a bug to me ros2/rclcpp#2704 TODO List
|
From the PRs, I picked up work and continued on the
This only happens when IPC is enabled, and occurs both when composed into the same container and when each node is in separate processes (logging above from separate process tests). Our I think this is something that needs reporting to Bond or rclcpp, it looks to me like something on the shutdown procedure with intra-process comms that isn't quite right. I think its delivering a message after the object is destroyed. |
See #4841 for some analysis and potential minimal fix and/or clues for further work. |
I keep dabbling in this, can't help myself, like a puzzle. Just reporting in. I have not cracked it yet. I am avoiding the segfault by fixing the use after free using the createSafeSubscriptionMemFuncCallback() mechanism in bondcpp. But as you know the system tests fail to shutdown cleanly. I have a build of the latest, or very close to the latest of the following repositories with patches to bond_core and navigation2
I use tmux so have the following running in separate terminal windows in a Docker with bash environment
Notice that I am using the tb3_loopback_simulation to avoid the gazebo complexity. I am now playing around with reproducing, manually, what the system tests do to bring down the system when the tests are finished.
NOTE: that the tb3_loopback_simulation.launch.py is different to the gazebo based tb3_simulation_launch.py in potentially interesting way. The bulk of the nav2 nodes are run inside a component_container_isolated. But the loopback simulation has an independent map_server and lifecycle_manager. They both behave in the same way (i.e. stay alive when they should have shutdown). IOW this may rule out component_container_isolated being a culprit. After startup confirm all lifecycle nodes have been transitioned to active.
Perform the request to the lifecycle_manager_map_server to shut down the /map_server
As expected the map_server has transitioned to finalized.
But the processes for map_server and the lifecycle manager are still running.
Attaching a debugger to the lifecycle manager 98335
attaching a debugger to the map_server
Lifecycle manager is blocked here https://github.com/ros2/rclcpp/blob/2d1b770e858fdb20e9c25913341fb388100dd19e/rclcpp/include/rclcpp/wait_set_policies/sequential_synchronization.hpp#L280 IOW time_left_to_wait == -1ns. This is a default setting for get_next_executable() https://github.com/ros2/rclcpp/blob/2d1b770e858fdb20e9c25913341fb388100dd19e/rclcpp/include/rclcpp/executor.hpp#L536 As in it is not set to a value that would allow the loop to continue to function in SingleThreadedExecutor::spin() https://github.com/ros2/rclcpp/blob/2d1b770e858fdb20e9c25913341fb388100dd19e/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp#L41 Some local variables
|
I do that myself often 😉
We do that to exercise both the
This kind of thing 😄 I might link that analysis into the rclcpp ticket, that's probably good debugging for the maintainers to point to (another?) issue |
Previously, we didn't do this since IPC required the use of only the default QoS profiles and we use several others in Nav2 for various things. We should investigate if this challenge has been overcome now and we can use IPC.
Additionally, Nav2 default launches with rviz2 and application nodes that may subscribe to topics outside of the container. In this case, these performance improvements are negated, but I suppose worthwhile to include unless there's some reason not to. I believe that only the topics that are communicating outside of the process break IPC, but worth checking on that to make sure the entire node isn't broken out of IPC for a debug logging topic or something.
It would be great to get an update from @clalancette on these thoughts -- then execute as it is seen fit.
The text was updated successfully, but these errors were encountered: