Reference counter overflow in ddsrt_init results in busy loop in ROS2 humble rclpy #2156

ozanm42 · 2025-01-13T17:05:07Z

Hello all,

We are experiencing a stall in of our ROS2 humble nodes in the field after continuous working for 3 weeks.
The stall introduces itself as cpu throttling to 100% and the ROS2 callbacks to be not being serviced which points to a busy loop. I have traced the issue to a reference counter init_status overflowing to 0 after continuous increment over UINT_MAX32 in ddsrt_init function.

What is my theory?
When ROS2 calls for spin() function it is then eventually stepped into Executor._wait_for_ready_callbacks function which creates a list of subscriptions, timers, waitables.. and puts them into a WaitSet object here. This object creation path goes through rcl_wait_set_init -> rmw_create_wait_set -> dds_create_waitset -> dds_init -> ddsrt_init and increments the reference counter init_status there. But the inverse doesn't seem to hold true. Destruction of this object through rcl_wait_set_fini -> rmw_destroy_wait_set doesn't end the path on ddsrt_fini so the reference count never decreases which contradicts this statement.

What does this result in?
After (UINT_MAX32 - 0x80000001u / 1000HZ(my node's callback frequency) -> 24 days) init_status wraps to 0 and we hit retry loop here.
I would have expected this to happen other nodes as well but I think the prerequisite is single node in single container. If there was another node it'd would have incremented the counter to 1 so the busy wait status would have lasted intermittently(not 100% sure about this part).

Reproduction steps?
Here is my docker setup:

FROM ros:humble
SHELL ["/bin/bash", "-c"]
WORKDIR /workspace

ENV RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
RUN apt update && apt install -y python3-pip sqlite3 ros-humble-example-interfaces ssh gdb git ros-humble-rmw-cyclonedds-cpp ros-humble-cyclonedds-dbgsym ros-humble-rclcpp-dbgsym ros-humble-rmw-cyclonedds-cpp-dbgsym ros-$ROS_DISTRO-demo-nodes-py ros-$ROS_DISTRO-demo-nodes-cpp
COPY . .

# Set up SSH keys for GitHub access
RUN mkdir -p ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts

# Install Python dependencies from requirements.txt
RUN --mount=type=ssh python3 -m pip install -r requirements.txt

# Build the workspace
RUN source /opt/ros/$ROS_DISTRO/setup.bash && colcon build --cmake-args -DCMAKE_BUILD_TYPE=Debug

I create two instances of the container. I run
ros2 run demo_nodes_py listener
and
ros2 run demo_nodes_py talker
in each.

I inspect init_status using GDB + bfptrace and observe it incrementing without decrementing

sudo bpftrace -e '
u:/proc/149514/root/opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0:ddsrt_init
{
    printf("Value of init_status: %u\n", * 0x745f3cfbf2e8);
}'

Value of init_status: 2147485586
Value of init_status: 2147485587
Value of init_status: 2147485588
Value of init_status: 2147485589
Value of init_status: 2147485590
Value of init_status: 2147485591
Value of init_status: 2147485592
Value of init_status: 2147485593
Value of init_status: 2147485594
Value of init_status: 2147485595
Value of init_status: 2147485596
Value of init_status: 2147485597
Value of init_status: 2147485598
Value of init_status: 2147485599
Value of init_status: 2147485600
Value of init_status: 2147485601

I use GDB to change the memory of init_status to a number close UINT_MAX32. After some time the issue surfaces:

Value of init_status: 4294967294
Value of init_status: 4294967295
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0

CPU throttles and listening stops:

The text was updated successfully, but these errors were encountered:

ozanm42 changed the title ~~Reference counter overflow init in ddsrt_init results in busy loop in ROS2 humble rclpy~~ Reference counter overflow in ddsrt_init results in busy loop in ROS2 humble rclpy Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference counter overflow in ddsrt_init results in busy loop in ROS2 humble rclpy #2156

Reference counter overflow in ddsrt_init results in busy loop in ROS2 humble rclpy #2156

ozanm42 commented Jan 13, 2025 •

edited

Loading

Reference counter overflow in ddsrt_init results in busy loop in ROS2 humble rclpy #2156

Reference counter overflow in ddsrt_init results in busy loop in ROS2 humble rclpy #2156

Comments

ozanm42 commented Jan 13, 2025 • edited Loading

ozanm42 commented Jan 13, 2025 •

edited

Loading