You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are experiencing a stall in of our ROS2 humble nodes in the field after continuous working for 3 weeks.
The stall introduces itself as cpu throttling to 100% and the ROS2 callbacks to be not being serviced which points to a busy loop. I have traced the issue to a reference counter init_status overflowing to 0 after continuous increment over UINT_MAX32 in ddsrt_init function.
What is my theory?
When ROS2 calls for spin() function it is then eventually stepped into Executor._wait_for_ready_callbacks function which creates a list of subscriptions, timers, waitables.. and puts them into a WaitSet object here. This object creation path goes through rcl_wait_set_init -> rmw_create_wait_set -> dds_create_waitset -> dds_init -> ddsrt_init and increments the reference counter init_status there. But the inverse doesn't seem to hold true. Destruction of this object through rcl_wait_set_fini -> rmw_destroy_wait_set doesn't end the path on ddsrt_fini so the reference count never decreases which contradicts this statement.
What does this result in?
After (UINT_MAX32 - 0x80000001u / 1000HZ(my node's callback frequency) -> 24 days) init_status wraps to 0 and we hit retry loop here.
I would have expected this to happen other nodes as well but I think the prerequisite is single node in single container. If there was another node it'd would have incremented the counter to 1 so the busy wait status would have lasted intermittently(not 100% sure about this part).
Reproduction steps?
Here is my docker setup:
FROM ros:humble
SHELL ["/bin/bash", "-c"]
WORKDIR /workspace
ENV RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
RUN apt update && apt install -y python3-pip sqlite3 ros-humble-example-interfaces ssh gdb git ros-humble-rmw-cyclonedds-cpp ros-humble-cyclonedds-dbgsym ros-humble-rclcpp-dbgsym ros-humble-rmw-cyclonedds-cpp-dbgsym ros-$ROS_DISTRO-demo-nodes-py ros-$ROS_DISTRO-demo-nodes-cpp
COPY . .
# Set up SSH keys for GitHub access
RUN mkdir -p ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts
# Install Python dependencies from requirements.txt
RUN --mount=type=ssh python3 -m pip install -r requirements.txt
# Build the workspace
RUN source /opt/ros/$ROS_DISTRO/setup.bash && colcon build --cmake-args -DCMAKE_BUILD_TYPE=Debug
I create two instances of the container. I run ros2 run demo_nodes_py listener
and ros2 run demo_nodes_py talker
in each.
I inspect init_status using GDB + bfptrace and observe it incrementing without decrementing
sudo bpftrace -e '
u:/proc/149514/root/opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0:ddsrt_init
{
printf("Value of init_status: %u\n", * 0x745f3cfbf2e8);
}'
Value of init_status: 2147485586
Value of init_status: 2147485587
Value of init_status: 2147485588
Value of init_status: 2147485589
Value of init_status: 2147485590
Value of init_status: 2147485591
Value of init_status: 2147485592
Value of init_status: 2147485593
Value of init_status: 2147485594
Value of init_status: 2147485595
Value of init_status: 2147485596
Value of init_status: 2147485597
Value of init_status: 2147485598
Value of init_status: 2147485599
Value of init_status: 2147485600
Value of init_status: 2147485601
I use GDB to change the memory of init_status to a number close UINT_MAX32. After some time the issue surfaces:
Value of init_status: 4294967294
Value of init_status: 4294967295
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
CPU throttles and listening stops:
The text was updated successfully, but these errors were encountered:
ozanm42
changed the title
Reference counter overflow init in ddsrt_init results in busy loop in ROS2 humble rclpy
Reference counter overflow in ddsrt_init results in busy loop in ROS2 humble rclpy
Jan 13, 2025
Hello all,
We are experiencing a stall in of our ROS2 humble nodes in the field after continuous working for 3 weeks.
The stall introduces itself as cpu throttling to 100% and the ROS2 callbacks to be not being serviced which points to a busy loop. I have traced the issue to a reference counter
init_status
overflowing to 0 after continuous increment over UINT_MAX32 inddsrt_init
function.What is my theory?
When ROS2 calls for spin() function it is then eventually stepped into
Executor._wait_for_ready_callbacks
function which creates a list of subscriptions, timers, waitables.. and puts them into aWaitSet
object here. This object creation path goes throughrcl_wait_set_init
->rmw_create_wait_set
->dds_create_waitset
->dds_init
->ddsrt_init
and increments the reference counterinit_status
there. But the inverse doesn't seem to hold true. Destruction of this object throughrcl_wait_set_fini
->rmw_destroy_wait_set
doesn't end the path onddsrt_fini
so the reference count never decreases which contradicts this statement.What does this result in?
After (UINT_MAX32 - 0x80000001u / 1000HZ(my node's callback frequency) -> 24 days)
init_status
wraps to 0 and we hit retry loop here.I would have expected this to happen other nodes as well but I think the prerequisite is single node in single container. If there was another node it'd would have incremented the counter to 1 so the busy wait status would have lasted intermittently(not 100% sure about this part).
Reproduction steps?
Here is my docker setup:
I create two instances of the container. I run
ros2 run demo_nodes_py listener
and
ros2 run demo_nodes_py talker
in each.
I inspect init_status using GDB + bfptrace and observe it incrementing without decrementing
I use GDB to change the memory of
init_status
to a number close UINT_MAX32. After some time the issue surfaces:CPU throttles and listening stops:
The text was updated successfully, but these errors were encountered: