Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tag 0.3.0 (current rolling binary) not working: Scouting delay elapsed before start conditions are met. #439

Open
berndpfrommer opened this issue Jan 30, 2025 · 5 comments

Comments

@berndpfrommer
Copy link
Contributor

The current apt package on Ubuntu 24.04 (noble) "does not work" (more below).
This is the package:

ros-rolling-rmw-zenoh-cpp/noble 0.3.0-1noble.20250102.230425 amd64

When I run the demo nodes (all on the same host), I get this warning:

ros2 run demo_nodes_cpp talker                                                                                                       
2025-01-30T12:57:34.365307Z  WARN ThreadId(06) zenoh::net::runtime::orchestrator: Scouting delay elapsed before start conditions are met.                               
[INFO] [1738241855.370982479] [talker]: Publishing: 'Hello World: 1'                                                                                                    
[INFO] [1738241856.370704297] [talker]: Publishing: 'Hello World: 2'                                                                   

And neither topic nor node list work (the listener also receives no messages):

ros2 topic list
2025-01-30T12:57:41.659755Z  WARN ThreadId(06) zenoh::net::runtime::orchestrator: Scouting delay elapsed before start conditions are met.

I was able to reproduce this problem by compiling the rmw_zenoh repo from source using tag 0.3.0.
I also noted that with tag 0.3.0 I no longer have to start an rmw_zenoh_hd daemon.

But if I upgrade to the latest commit e638f8c I have to start the ros2 daemon manually again - AND the "Scouting delay elapsed" message goes away, and everything works again.

What happened with 0.3.0? I don't understand why this was ever released and how you did not catch that in testing. Given the generally solid experience I had with rmw_zenoh, I still have nagging doubts that somehow my setup is very unusual or something is wrong on my side.

The other thing is: I don't like the ros2 daemon. It creates hidden state that I find confusing when switching between rmw versions (I went back and forth a couple times for testing). I'd much rather start the rmw_zenoh_hd manually, then I know exactly what process is doing the communication and I see its console log. Granted that's a matter of taste, but it may be worthwhile to poll people what they prefer.

@clalancette
Copy link
Collaborator

The other thing is: I don't like the ros2 daemon. It creates hidden state that I find confusing when switching between rmw versions (I went back and forth a couple times for testing).

See #242

@Yadunund
Copy link
Member

Yadunund commented Jan 30, 2025

Hi @berndpfrommer,

Thanks for the ticket and trying out the binaries we released!

I was not able to reproduce your error on my system. Please see my screen recording below.

rmw_zenoh_439.mp4

Is there any chance you were running rmw_zenoh built from the latest source before you switched to rmw_zenoh binaries? i'd like to point out that the version of Zenoh in the latest source (Zenoh 1.1.x) is newer than what's shipped with 0.3.0 (Zenoh 1.0.2). The version was bumped in this PR #424. I can imagine some interference if the environment you used to run your tests mixed source and binary installs.
We plan to bump rmw_zenoh binaries next week to include the Zenoh 1.1 migration and other fixes we've adopted since but the problems you are reporting should not be present in the binaries released.

What happened with 0.3.0? I don't understand why this was ever released and how you did not catch that in testing. Given the generally solid experience I had with rmw_zenoh, I still have nagging doubts that somehow my setup is very unusual or something is wrong on my side.

Before releasing 0.3.0, the team (although very small), ran nearly all tests in ROS 2 which run in our nightly CIs (including system_tests, and all tests from rclcpp, rclcpp_lifecycle, rclcpp_components, rclcpp_action, rcl, rcl_lifecycle, rcl_action, rclpy and rosbag2) multiple times. The only consistent test failure is #326 which we are addressing (there are some other flaky ones).
Before merging any new changes, we also run the same test suite against the changes and report any regressions. Example: #419 (comment)

Could I trouble you to try the binaries again but in a clean environment? (there could be routers running in the background or perhaps even ros2daemons using a different version of Zenoh.

  1. Close any open terminals
  2. Run pkill -9 -f ros; pkill -9 -f zenoh in a new terminal
  3. Ensure no other ROS workspaces are sourced in ~/.bashrc or equivalent
  4. source /opt/ros/rolling/setup.bash
  5. Start the router
  6. Repeat step 4 and run talker / other nodes.

Killing any zenoh routers / daemons before switching versions might be the solution here and we'll need to work towards addressing that ticket linked above.

@Yadunund
Copy link
Member

Yadunund commented Jan 30, 2025

I also noted that with tag 0.3.0 I no longer have to start an rmw_zenoh_hd daemon.

Okay I think I realize the problem now....

I believe you ran talker and ros2 topic list without starting the zenohd router? I get the same scouted related printout when I do that. But when I start the zenohd router after the talker is running, I can run ros2 topic list again and see the /chatter topic. See screen recording below

rmw_zenoh_439_without_router.mp4

One bug in the 0.3.0 release is that a warning message is not printed out stating that the node did not connect to a router. This was a regression that was fixed in #427 and is not yet released.

With the latest source, starting talker without the router running will print out this warning.

[WARN] [1738275468.136367766] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?

Some time ago (before 0.3.0 release), we changed the default behavior in rmw_zenoh to not block if the router is not detected #308. But a regression was introduced which was patched in #427.

To summarize

  • By default in rmw_zenoh, peers always needs to connect to a router to discover one another. ros2 node list gets graph data from the ros2daemon which is the "hidden" peer. And in your case, without the router running, the daemon was not discovering the talker.
  • The 0.3.0 binaries has a regression where a warning message is not printed out if the peer does not find a router.
  • If you start the zenohd router after, peers including ros2daemon should be able to discover any previously started nodes.

@berndpfrommer
Copy link
Contributor Author

Funny, I was just testing in a clean environment and indeed everything works fine if I start the zenoh router before starting the talker. Since there was no complaint I was thinking rmw_zenoh_cpp was changed to start the ros2 daemon. Things like that throw me back to the times before rmw_zenoh, when the simplest stuff just didn't work and there is always a little gotcha that explains everything and makes me feel stupid.

Looking forward to the next release where this is fixed. In the meanwhile I'll stick with the latest build from source.

And please leave this issue open until the next binaries are released, such that other idiots like me can search and find out what the problem is.

@Yadunund
Copy link
Member

Thanks for the feedback.

Things should work too if you start the router after.

Even with the latest source build, I now realize that printing the warning once and continuing with initialization probably gives users a false sense of surety that the router was somehow started in the background. I've opened #440 to make it clear that users still need to start the router. Will ensure this change is in the next release.

And please leave this issue open until the next binaries are released, such that other idiots like me can search and find out what the problem is.

Sure thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants