Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate adding in explicit IPC for Nav2 nodes again #4691

Open
SteveMacenski opened this issue Sep 25, 2024 · 8 comments
Open

Investigate adding in explicit IPC for Nav2 nodes again #4691

SteveMacenski opened this issue Sep 25, 2024 · 8 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@SteveMacenski
Copy link
Member

Previously, we didn't do this since IPC required the use of only the default QoS profiles and we use several others in Nav2 for various things. We should investigate if this challenge has been overcome now and we can use IPC.

Additionally, Nav2 default launches with rviz2 and application nodes that may subscribe to topics outside of the container. In this case, these performance improvements are negated, but I suppose worthwhile to include unless there's some reason not to. I believe that only the topics that are communicating outside of the process break IPC, but worth checking on that to make sure the entire node isn't broken out of IPC for a debug logging topic or something.

It would be great to get an update from @clalancette on these thoughts -- then execute as it is seen fit.

@SteveMacenski SteveMacenski added enhancement New feature or request good first issue Good for newcomers labels Sep 25, 2024
@clalancette
Copy link
Contributor

It would be great to get an update from @clalancette on these thoughts -- then execute as it is seen fit.

So during the Jazzy development cycle, we ended up merging in ros2/rclcpp#2303 , which added transient_local functionality to the intra-process manager. However, as far as I understand, the intra-process manager still doesn't support the full range of QoS options.

Still, if transient_local functionality is enough for your use-case, you could consider turning it on. There are 2 different ways to turn it on:

  1. You can pass an rclcpp::NodeOption when constructing the node that will create all publishers and subscriptions (including the built-in ones) as intra-process.
  2. When you create individual publishers and subscriptions, you can turn on intra-process mode for just that entity.

Additionally, Nav2 default launches with rviz2 and application nodes that may subscribe to topics outside of the container. In this case, these performance improvements are negated, but I suppose worthwhile to include unless there's some reason not to. I believe that only the topics that are communicating outside of the process break IPC, but worth checking on that to make sure the entire node isn't broken out of IPC for a debug logging topic or something.

The intra-process manager is fairly sophisticated. If a publisher and subscription on the same topic with the same type are created within the same Context, and intra-process is turned on for both of them, then it will communicate between them using only intra-process. If any one of those conditions is false, it won't use it. However, even in the case where intra-process is used, it still always creates a DDS entity for the outside world to discover. That's so that things like ros2 topic list still show the topic. In the case where you have an intra-process publisher, an intra-process subscription, and a non-intra-process subscription, then what happens is that intra-process is still used between the publisher and the intra-process subscription, and a copy is used to send out to the non-intra-process subscription.

All of that is to say is that the intra-process manager will do the correct thing in all situations, though the efficiency goes down somewhat in situations other than a single intra-process publisher and single intra-process subscription.

@SteveMacenski
Copy link
Member Author

still doesn't support the full range of QoS options.

Are the 'main' 4 (history, durability, depth, reliability) fully covered? We use throughout Nav2 all the options for those four in various places, but not sensitive about the deadline, lifespan, liveliness, or lease duration (yet, at least).

In the case where you have an intra-process publisher, an intra-process subscription, and a non-intra-process subscription, then what happens is that intra-process is still used between the publisher and the intra-process subscription, and a copy is used to send out to the non-intra-process subscription.
All of that is to say is that the intra-process manager will do the correct thing in all situations, though the efficiency goes down somewhat in situations other than a single intra-process publisher and single intra-process subscription.

I believe then I was either initially misinformed or something has changed since I last looked -- was at any point it true that if a subscription was outside of IPC (intra-), it changed all subscriptions to be inter-process?

Follow up question: If we have 2 subscriptions, one intra-process and one inter-process, is the performance any worse on the system by enabling (intra-) IPC than if we did not at all? I assume not, but your the wording of that response makes me want to clarify if there's any reason not to always enable IPC now.

It also used to be true that IPC (intra-) would throw if a sub/pub was using QoS that IPC couldn't handle. Is that still the case or can we enable IPC (intra-) globally now in the software and it'll use where it can, but default back to inter-process (though still with composition benefits) where it cannot?

Thanks for the follow up :-) It looks like IPC has made some great gains since the last I looked at this in the early Foxy days. We got such good performance boosts from Composition and the QoS would throw for the non-default profile, we weren't able to move further. It sounds like we might be able to fully embrace IPC now across the board for another joyous performance boost!

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Dec 12, 2024

I've started on this today and ran into some problems. I filed a ticket for better logging on IPC-specific errors since its really difficult to find where they're coming from in applications with 20+ subscribers and publishers ros2/rclcpp#2703

I believe the system default QoS doesn't actually set the depth size to > 0 which causes a problem with IPC. I think that should probably be updated to force it to be > 0 so that folks don't run into this issue. Feels like a not-awesome user experience to have to manually adjust this (rclcpp::SystemDefaultsQoS().keep_last(1)) for what should be the "default" commonly-used QoS profile. I still need to get around that globally, but after fixing the instances I needed to move on for a particular node. I'm not sure why this one case is such a problem but others using SystemDefaultsQoS work fine, but the next issue is:

transient_local publisher needs to passa valid publisher buffer ptr when calling add_publisher()

which I'm not 100% sure what is going on there, since I thought Jazzy/Rolling supported transient_local QoS now (on Rolling for this work). It appears to be coming from the following, which is interesting since its not even transient local QoS. Removing this line or changing to QoS to a non-default-QoS modified setting makes the warning disappear. Though, there is some transient_local QoS immediately after this line (see collapsed snippet below).

Code Snippet
  auto custom_qos = rclcpp::QoS(rclcpp::KeepLast(1)).transient_local().reliable();
  costmap_pub_ = node->create_publisher<nav_msgs::msg::OccupancyGrid>(
    topic_name, custom_qos);
  costmap_raw_pub_ = node->create_publisher<nav2_msgs::msg::Costmap>(
    topic_name + "_raw", custom_qos);
  costmap_update_pub_ = node->create_publisher<map_msgs::msg::OccupancyGridUpdate>(
    topic_name + "_updates", custom_qos);
  costmap_raw_update_pub_ = node->create_publisher<nav2_msgs::msg::CostmapUpdate>(
    topic_name + "_raw_updates", custom_qos);
  auto qos = rclcpp::SystemDefaultsQoS().keep_last(1);
  footprint_pub_ = create_publisher<geometry_msgs::msg::PolygonStamped>(
    "published_footprint", qos);

Error's coming from here: https://github.com/ros2/rclcpp/blob/8c0161a07f1a44db023650d10f1b2e581c64e1f1/rclcpp/src/rclcpp/intra_process_manager.cpp#L48

The buffer on face value looks like it should be valid from https://github.com/ros2/rclcpp/blob/rolling/rclcpp/include/rclcpp/publisher.hpp#L168-L172. I don't see anywhere in create_intra_process_buffer that it would return a nullptr without itself a throw which would be the log I should see. Even separately of anything I can be doing wrong, that's a state I don't see logically how I could enter.

Here's my working branch https://github.com/ros-navigation/navigation2/tree/ipc

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Dec 12, 2024

Moving past that since its not blocking, though I found the root cause, SystemDefaultsQoS profile works fine for subscriptions but does not for publishers. Filing ros2/rclcpp#2705

I generally see things working!

The only major issue is that the transient local topics that are only published once are not responding to late-joining subscriptions to transfer information. For example, the map is only published once on startup as transient local, but rviz doesn't consistently get it (depending if it was up in time to receive it) and ros2 topic echo /map --qos-durability transient_local only returns a value if it is running at the same time as the initial publication -- it never receives the map afterward.

Is transient local publication tested to work on late joining subscriptions in Rolling/Jazzy? I looks like a bug to me ros2/rclcpp#2704


TODO List

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Jan 7, 2025

From the PRs, I picked up work and continued on the ipc branch. Things generally look good after iterating through many of the algorithms and systems. However, I'm not able to achieve clean system shutdown which is causing all nav2_system_tests to fail without results due to server crashes in bringing down the IPC lifecycle nodes (note the docking server isn't special; its just the first to be deactivated):

    [tester_node-18] [INFO] [1736203371.247419501] [nav2_tester]: [NON-XML-CHAR-0x1B][1;37;44m*** GOAL REACHED ***[NON-XML-CHAR-0x1B][0m
    [tester_node-18] [INFO] [1736203371.248001969] [nav2_tester]: [NON-XML-CHAR-0x1B][1;37;44mTest PASSED[NON-XML-CHAR-0x1B][0m
    [tester_node-18] [INFO] [1736203371.248542509] [nav2_tester]: [NON-XML-CHAR-0x1B][1;37;44mShutting down[NON-XML-CHAR-0x1B][0m
    [controller_server-8] [INFO] [1736203371.267507440] [controller_server]: Received a goal, begin computing control effort.
    [lifecycle_manager-17] [INFO] [1736203371.276209290] [lifecycle_manager_navigation]: [NON-XML-CHAR-0x1B][34m[NON-XML-CHAR-0x1B][1mTerminating bond timer...[NON-XML-CHAR-0x1B][0m[NON-XML-CHAR-0x1B][0m
    [lifecycle_manager-17] [INFO] [1736203371.276404686] [lifecycle_manager_navigation]: [NON-XML-CHAR-0x1B][34m[NON-XML-CHAR-0x1B][1mShutting down managed nodes...[NON-XML-CHAR-0x1B][0m[NON-XML-CHAR-0x1B][0m
    [lifecycle_manager-17] [INFO] [1736203371.276439542] [lifecycle_manager_navigation]: [NON-XML-CHAR-0x1B][34m[NON-XML-CHAR-0x1B][1mDeactivate, cleanup, and shutdown nodes[NON-XML-CHAR-0x1B][0m[NON-XML-CHAR-0x1B][0m
    [lifecycle_manager-17] [INFO] [1736203371.276467498] [lifecycle_manager_navigation]: [NON-XML-CHAR-0x1B][34m[NON-XML-CHAR-0x1B][1mDeactivating docking_server[NON-XML-CHAR-0x1B][0m[NON-XML-CHAR-0x1B][0m
    [tester_node-18] [INFO] [1736203371.277809765] [nav2_tester]: [NON-XML-CHAR-0x1B][1;37;44mShutting down navigation lifecycle manager...[NON-XML-CHAR-0x1B][0m
    [opennav_docking-16] [INFO] [1736203371.278556647] [docking_server]: Deactivating docking_server
    [controller_server-8] [INFO] [1736203371.279730050] [controller_server]: Reached the goal!
    [bt_navigator-12] [INFO] [1736203371.297110721] [bt_navigator]: Goal succeeded
    [opennav_docking-16] [INFO] [1736203371.310202015] [docking_server]: Destroying bond (docking_server) to lifecycle manager.
    [ERROR] [opennav_docking-16]: process has died [pid 79857, exit code -11, cmd '/root/jazzy_ws/install/opennav_docking/lib/opennav_docking/opennav_docking --ros-args --log-level info --ros-args -r __node:=docking_server -r __ns:=/ -p use_sim_time:=True --params-file /tmp/launch_params_m5ysw27v -r /tf:=tf -r /tf_static:=tf_static'].
    [lifecycle_manager-17] [ERROR] [1736203373.412346510] [lifecycle_manager_navigation]: Failed to change state for node: docking_server. Exception: docking_server/get_state service client: async_send_request failed.
    [lifecycle_manager-17] [INFO] [1736203373.412487978] [lifecycle_manager_navigation]: [NON-XML-CHAR-0x1B][34m[NON-XML-CHAR-0x1B][1mCleaning up docking_server[NON-XML-CHAR-0x1B][0m[NON-XML-CHAR-0x1B][0m

This only happens when IPC is enabled, and occurs both when composed into the same container and when each node is in separate processes (logging above from separate process tests). Our nav2_lifecycle_manager test test_bond also fails with the same issue on a unit level https://app.circleci.com/pipelines/github/ros-navigation/navigation2/13306/workflows/4bcc5daa-4e7a-4048-b1f3-3d418d215043/jobs/39892/tests that is broadly independent of Nav2, so its not just in a full system situation or specific to any particular node implementation. I've been able to show that it crashes after exiting on_deactivate but before on_cleanup is called.

I think this is something that needs reporting to Bond or rclcpp, it looks to me like something on the shutdown procedure with intra-process comms that isn't quite right. I think its delivering a message after the object is destroyed.

Screenshot from 2025-01-06 16-13-06
Screenshot from 2025-01-06 16-23-19

@ewak
Copy link

ewak commented Jan 10, 2025

See #4841 for some analysis and potential minimal fix and/or clues for further work.

@ewak
Copy link

ewak commented Jan 17, 2025

I keep dabbling in this, can't help myself, like a puzzle. Just reporting in. I have not cracked it yet.

I am avoiding the segfault by fixing the use after free using the createSafeSubscriptionMemFuncCallback() mechanism in bondcpp. But as you know the system tests fail to shutdown cleanly.

I have a build of the latest, or very close to the latest of the following repositories with patches to bond_core and navigation2

https://github.com/ros/bond_core.git
https://github.com/ros2/common_interfaces.git
https://github.com/ros2/geometry2.git
https://github.com/ros-navigation/nav2_minimal_turtlebot_simulation.git
https://github.com/ros-navigation/navigation2.git
https://github.com/ros2/rclcpp.git

I use tmux so have the following running in separate terminal windows in a Docker with bash environment RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

tmux split-window ros2 run tf2_ros static_transform_publisher --frame-id map --child-frame-id odom                                                                                                                                                                                                                                                                                                                           
tmux split-window ros2 run tf2_ros static_transform_publisher --frame-id odom --child-frame-id base_link --x 10                                                                                                                                                                                                                                                                                                              
tmux split-window ros2 launch nav2_bringup tb3_loopback_simulation.launch.py use_rviz:=False

Notice that I am using the tb3_loopback_simulation to avoid the gazebo complexity.

I am now playing around with reproducing, manually, what the system tests do to bring down the system when the tests are finished.

ros2 service call /lifecycle_manager_navigation/manage_nodes nav2_msgs/srv/ManageLifecycleNodes command:\ 4
ros2 service call /lifecycle_manager_map_server/manage_nodes nav2_msgs/srv/ManageLifecycleNodes command:\ 4

NOTE: that the tb3_loopback_simulation.launch.py is different to the gazebo based tb3_simulation_launch.py in potentially interesting way. The bulk of the nav2 nodes are run inside a component_container_isolated. But the loopback simulation has an independent map_server and lifecycle_manager.

They both behave in the same way (i.e. stay alive when they should have shutdown). IOW this may rule out component_container_isolated being a culprit.

After startup confirm all lifecycle nodes have been transitioned to active.

root@dev-lt-wakem:~/nav2_ws# for i in $(ros2 lifecycle nodes); do echo $i; ros2 lifecycle get $i;  done
/behavior_server
active [3]
/bt_navigator
active [3]
/collision_monitor
active [3]
/controller_server
active [3]
/docking_server
active [3]
/global_costmap/global_costmap
active [3]
/local_costmap/local_costmap
active [3]
/map_server
active [3]
/planner_server
active [3]
/smoother_server
active [3]
/velocity_smoother
active [3]
/waypoint_follower
active [3]

Perform the request to the lifecycle_manager_map_server to shut down the /map_server

root@dev-lt-wakem:~/nav2_ws# ros2 service call /lifecycle_manager_map_server/manage_nodes nav2_msgs/srv/ManageLifecycleNodes command:\ 4
requester: making request: nav2_msgs.srv.ManageLifecycleNodes_Request(command=4)

response:
nav2_msgs.srv.ManageLifecycleNodes_Response(success=True)

As expected the map_server has transitioned to finalized.

root@dev-lt-wakem:~/nav2_ws# for i in $(ros2 lifecycle nodes); do echo $i; ros2 lifecycle get $i;  done
/behavior_server
active [3]
/bt_navigator
active [3]
/collision_monitor
active [3]
/controller_server
active [3]
/docking_server
active [3]
/global_costmap/global_costmap
active [3]
/local_costmap/local_costmap
active [3]
/map_server
finalized [4]  <-------------------------
/planner_server
active [3]
/smoother_server
active [3]
/velocity_smoother
active [3]
/waypoint_follower
active [3]

But the processes for map_server and the lifecycle manager are still running.

ps aux | egrep 'map_server|lifecycle_manager_map_server'
/nav2_ws/install/nav2_map_server/lib/nav2_map_server/map_server --ros-args -r __node:=map_server -p use_sim_time:=True --params-file /tmp/launch_params_e0bmy5ue --params-file /tmp/launch_params_dy2lv_x7 -r /tf:=tf -r /tf_static:=tf_static
  98335 pts/4    Sl+    0:07 /root/nav2_ws/install/nav2_lifecycle_manager/lib/nav2_lifecycle_manager/lifecycle_manager --ros-args -r __node:=lifecycle_manager_map_server -p use_sim_time:=True --params-file /tmp/launch_params_e0bmy5ue --params-file /tmp/launch_params_ah0vagob --params-file /tmp/launch_params_1p_k3cy9

Attaching a debugger to the lifecycle manager 98335

root@dev-lt-wakem:~/nav2_ws# gdb /root/nav2_ws/install/nav2_lifecycle_manager/lib/nav2_lifecycle_manager/lifecycle_manager 98335
... blah ...
(gdb) bt
#0  0x000073581e408d61 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000073581e40b7dd in pthread_cond_wait () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000073581c3125dd in ddsrt_cond_wait () from /opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0
#3  0x000073581c312695 in ddsrt_cond_waituntil () from /opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0
#4  0x000073581c2f8b39 in ?? () from /opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0
#5  0x000073581c3ac885 in rmw_wait () from /opt/ros/rolling/lib/librmw_cyclonedds_cpp.so
#6  0x000073581e201a7c in rcl_wait () from /opt/ros/rolling/lib/librcl.so
#7  0x000073581f33ca29 in rclcpp::wait_set_policies::SequentialSynchronization::sync_wait<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > >(std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::function<void ()>, std::function<rcl_wait_set_s& ()>, std::function<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > (rclcpp::WaitResultKind)>) (this=0x7ffdee111608, time_to_wait_ns=std::chrono::duration = { -1ns }, rebuild_rcl_wait_set=..., get_rcl_wait_set=..., create_wait_result=...) at /root/nav2_ws/src/rclcpp/rclcpp/include/rclcpp/wait_set_policies/sequential_synchronization.hpp:280
#8  0x000073581f334f55 in rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage>::wait<long, std::ratio<1l, 1000000000l> > (this=0x7ffdee111608, time_to_wait=std::chrono::duration = { -1ns }) at /root/nav2_ws/src/rclcpp/rclcpp/include/rclcpp/wait_set_template.hpp:712
#9  0x000073581f31f8e1 in rclcpp::Executor::wait_for_work (this=0x7ffdee1113a0, timeout=std::chrono::duration = { -1ns }) at /root/nav2_ws/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:755
#10 0x000073581f320a16 in rclcpp::Executor::get_next_executable (this=0x7ffdee1113a0, any_executable=..., timeout=std::chrono::duration = { -1ns }) at /root/nav2_ws/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:907
#11 0x000073581f3765f3 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7ffdee1113a0) at /root/nav2_ws/src/rclcpp/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp:41
#12 0x000073581f35dd4d in rclcpp::spin (node_ptr=std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface> (use count 33, weak count 3) = {...}) at /root/nav2_ws/src/rclcpp/rclcpp/src/rclcpp/executors.cpp:56
#13 0x000073581f35de98 in rclcpp::spin (node_ptr=warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<nav2_lifecycle_manager::LifecycleManager, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<nav2_lifecycle_manager::LifecycleManager, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr<rclcpp::Node> (use count 2, weak count 1) = {...}) at /root/nav2_ws/src/rclcpp/rclcpp/src/rclcpp/executors.cpp:63
#14 0x000061a59cbbf451 in main ()

attaching a debugger to the map_server

root@dev-lt-wakem:~/nav2_ws# gdb /root/nav2_ws/install/nav2_map_server/lib/nav2_map_server/map_server 98334
... blah ...
(gdb) bt
#0  0x000073702119fd61 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007370211a27dd in pthread_cond_wait () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000073701c1895dd in ddsrt_cond_wait () from /opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0
#3  0x000073701c189695 in ddsrt_cond_waituntil () from /opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0
#4  0x000073701c16fb39 in ?? () from /opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0
#5  0x000073701c223885 in rmw_wait () from /opt/ros/rolling/lib/librmw_cyclonedds_cpp.so
#6  0x0000737020f41a7c in rcl_wait () from /opt/ros/rolling/lib/librcl.so
#7  0x000073702213ca29 in rclcpp::wait_set_policies::SequentialSynchronization::sync_wait<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > >(std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::function<void ()>, std::function<rcl_wait_set_s& ()>, std::function<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > (rclcpp::WaitResultKind)>) (this=0x7ffd462e4be8, time_to_wait_ns=std::chrono::duration = { -1ns }, 
    rebuild_rcl_wait_set=..., get_rcl_wait_set=..., create_wait_result=...) at /root/nav2_ws/src/rclcpp/rclcpp/include/rclcpp/wait_set_policies/sequential_synchronization.hpp:280
#8  0x0000737022134f55 in rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage>::wait<long, std::ratio<1l, 1000000000l> > (
    this=0x7ffd462e4be8, time_to_wait=std::chrono::duration = { -1ns }) at /root/nav2_ws/src/rclcpp/rclcpp/include/rclcpp/wait_set_template.hpp:712
#9  0x000073702211f8e1 in rclcpp::Executor::wait_for_work (this=0x7ffd462e4980, timeout=std::chrono::duration = { -1ns }) at /root/nav2_ws/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:755
#10 0x0000737022120a16 in rclcpp::Executor::get_next_executable (this=0x7ffd462e4980, any_executable=..., timeout=std::chrono::duration = { -1ns })
    at /root/nav2_ws/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:907
#11 0x00007370221765f3 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7ffd462e4980) at /root/nav2_ws/src/rclcpp/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp:41
#12 0x000073702215dd4d in rclcpp::spin (node_ptr=std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface> (use count 37, weak count 3) = {...})
    at /root/nav2_ws/src/rclcpp/rclcpp/src/rclcpp/executors.cpp:56

Lifecycle manager is blocked here https://github.com/ros2/rclcpp/blob/2d1b770e858fdb20e9c25913341fb388100dd19e/rclcpp/include/rclcpp/wait_set_policies/sequential_synchronization.hpp#L280 IOW time_left_to_wait == -1ns.

This is a default setting for get_next_executable() https://github.com/ros2/rclcpp/blob/2d1b770e858fdb20e9c25913341fb388100dd19e/rclcpp/include/rclcpp/executor.hpp#L536

As in it is not set to a value that would allow the loop to continue to function in SingleThreadedExecutor::spin() https://github.com/ros2/rclcpp/blob/2d1b770e858fdb20e9c25913341fb388100dd19e/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp#L41

Some local variables

(gdb) set print pretty on
(gdb) frame 7
#7  0x000073581f33ca29 in rclcpp::wait_set_policies::SequentialSynchronization::sync_wait<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > >(std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::function<void ()>, std::function<rcl_wait_set_s& ()>, std::function<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > (rclcpp::WaitResultKind)>) (this=0x7ffdee111608, time_to_wait_ns=std::chrono::duration = { -1ns }, rebuild_rcl_wait_set=..., get_rcl_wait_set=..., create_wait_result=...) at /root/nav2_ws/src/rclcpp/rclcpp/include/rclcpp/wait_set_policies/sequential_synchronization.hpp:280
280           rcl_ret_t ret = rcl_wait(&rcl_wait_set, time_left_to_wait_ns.count());
(gdb) p rcl_wait_set
$10 = (rcl_wait_set_t &) @0x7ffdee111608: {
  subscriptions = 0x61a59dc9eb50,
  size_of_subscriptions = 2,
  guard_conditions = 0x61a59dc8c860,
  size_of_guard_conditions = 10,
  timers = 0x61a59dc8c550,
  size_of_timers = 5,
  clients = 0x0,
  size_of_clients = 0,
  services = 0x61a59dc9ec40,
  size_of_services = 7,
  events = 0x61a59dc94af0,
  size_of_events = 10,
  impl = 0x61a59dc9b700
}
(gdb) p time_left_to_wait_ns 
$11 = std::chrono::duration = { -1ns }

@SteveMacenski
Copy link
Member Author

I keep dabbling in this, can't help myself, like a puzzle. Just reporting in. I have not cracked it yet.

I do that myself often 😉

NOTE: that the tb3_loopback_simulation.launch.py is different to the gazebo based tb3_simulation_launch.py in potentially interesting way. The bulk of the nav2 nodes are run inside a component_container_isolated. But the loopback simulation has an independent map_server and lifecycle_manager.

We do that to exercise both the main() node and the component nodes. There's no technical reason one is done one way and another is another - other than having full coverage to uncover issues that might be the result of composition or on-network DDS.

IOW this may rule out component_container_isolated being a culprit.

This kind of thing 😄


I might link that analysis into the rclcpp ticket, that's probably good debugging for the maintainers to point to (another?) issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants