Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference counter overflow in ddsrt_init results in busy loop in ROS2 humble rclpy #2156

Open
ozanm42 opened this issue Jan 13, 2025 · 0 comments

Comments

@ozanm42
Copy link

ozanm42 commented Jan 13, 2025

Hello all,

We are experiencing a stall in of our ROS2 humble nodes in the field after continuous working for 3 weeks.
The stall introduces itself as cpu throttling to 100% and the ROS2 callbacks to be not being serviced which points to a busy loop. I have traced the issue to a reference counter init_status overflowing to 0 after continuous increment over UINT_MAX32 in ddsrt_init function.

What is my theory?
When ROS2 calls for spin() function it is then eventually stepped into Executor._wait_for_ready_callbacks function which creates a list of subscriptions, timers, waitables.. and puts them into a WaitSet object here. This object creation path goes through rcl_wait_set_init -> rmw_create_wait_set -> dds_create_waitset -> dds_init -> ddsrt_init and increments the reference counter init_status there. But the inverse doesn't seem to hold true. Destruction of this object through rcl_wait_set_fini -> rmw_destroy_wait_set doesn't end the path on ddsrt_fini so the reference count never decreases which contradicts this statement.

What does this result in?
After (UINT_MAX32 - 0x80000001u / 1000HZ(my node's callback frequency) -> 24 days) init_status wraps to 0 and we hit retry loop here.
I would have expected this to happen other nodes as well but I think the prerequisite is single node in single container. If there was another node it'd would have incremented the counter to 1 so the busy wait status would have lasted intermittently(not 100% sure about this part).

Reproduction steps?
Here is my docker setup:

FROM ros:humble
SHELL ["/bin/bash", "-c"]
WORKDIR /workspace

ENV RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
RUN apt update && apt install -y python3-pip sqlite3 ros-humble-example-interfaces ssh gdb git ros-humble-rmw-cyclonedds-cpp ros-humble-cyclonedds-dbgsym ros-humble-rclcpp-dbgsym ros-humble-rmw-cyclonedds-cpp-dbgsym ros-$ROS_DISTRO-demo-nodes-py ros-$ROS_DISTRO-demo-nodes-cpp
COPY . .

# Set up SSH keys for GitHub access
RUN mkdir -p ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts

# Install Python dependencies from requirements.txt
RUN --mount=type=ssh python3 -m pip install -r requirements.txt

# Build the workspace
RUN source /opt/ros/$ROS_DISTRO/setup.bash && colcon build --cmake-args -DCMAKE_BUILD_TYPE=Debug

I create two instances of the container. I run
ros2 run demo_nodes_py listener
and
ros2 run demo_nodes_py talker
in each.

I inspect init_status using GDB + bfptrace and observe it incrementing without decrementing

sudo bpftrace -e '
u:/proc/149514/root/opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0:ddsrt_init
{
    printf("Value of init_status: %u\n", * 0x745f3cfbf2e8);
}'

Value of init_status: 2147485586
Value of init_status: 2147485587
Value of init_status: 2147485588
Value of init_status: 2147485589
Value of init_status: 2147485590
Value of init_status: 2147485591
Value of init_status: 2147485592
Value of init_status: 2147485593
Value of init_status: 2147485594
Value of init_status: 2147485595
Value of init_status: 2147485596
Value of init_status: 2147485597
Value of init_status: 2147485598
Value of init_status: 2147485599
Value of init_status: 2147485600
Value of init_status: 2147485601

I use GDB to change the memory of init_status to a number close UINT_MAX32. After some time the issue surfaces:

Value of init_status: 4294967294
Value of init_status: 4294967295
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0
Value of init_status: 0

CPU throttles and listening stops:

Screenshot from 2025-01-13 11-18-29

@ozanm42 ozanm42 changed the title Reference counter overflow init in ddsrt_init results in busy loop in ROS2 humble rclpy Reference counter overflow in ddsrt_init results in busy loop in ROS2 humble rclpy Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant