-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not getting service responses reliably when using CycloneDDS #74
Comments
I can still reproduce on |
I can confirm the problem. My first guess is a race that the client discovers the server and sends it request but when the server receives the request and wants to publish the reply it hasn't discovered the client yet. With a simple hack (adding a sleep of 500ms into the server callback function) the client receives the response reliable. I guess we might something like a "wait_for_client" in the same way as we already offer a "wait_for_service". |
Cyclone implements what the spec calls "stateful" readers and writers, which means the server's reader for the request topic won't accept any data until it has discovered the client's writer. "wait_for_service" waits until the server's reader for the request topic has been discovered, consequently the client's writer will hold on to the request message until the service's reader acknowledges it. It is therefore pretty much impossible that service is unaware of the existence of the client. Reader and writer discovery do run independently in RTPS and so it is technically possible that the server has discovered the client's request writer and handled the request without having discovered the client's response reader yet, but that does seem unlikely. I haven't had a chance to try to reproduce it yet (and this weekend it's not likely to happen either because of all kinds of obligations) but if I were to try, I'd probably enable Cyclone's tracing to get some more insight in what is going on. (Not for the faint of heart, allegedly … CYCLONEDDS_URI='finest</>cdds.log.${CYCLONEDDS_PID}' suffices. You have to squint a bit to figure out how that relates to proper XML 🙂.) |
I'm no expert here. Having said that, I think that may be the case @eboasson. The server sets up a response writer and a request reader first:
When the client gets brought up, it discovers the server in full:
And then it sets up a response reader and a request writer:
The server discovers the client's request writer first:
The client then proceeds to send a request:
And the server receives and replies:
But it isn't till shortly after that the server discovers the client's response reader:
I've updated both logs here. Mind to check if my logic is correct? Maybe I'm missing something in the midst of the ack/heartbeat dance. |
@hidmic your analysis is entirely correct. I'm a bit surprised that the dance works out this way, but it is definitely what happens. @dirk-thomas's "wait_for_client" would solve this, but it is a bit of tricky operation in that it must gracefully deal with the client disappearing while the service is waiting. What I would find more elegant is for the client to first wait for the server to have discovered the response reader, and only then send the request. That way the server will never have to wait and is still certain that the response will go out properly. I think that that is possible in principle: guarantee that the endpoint discovery data doesn't get acknowledged until after it has been processed completely, and then offer a way of doing "wait_for_acknowledgements" on the endpoint discovery writers. But I don't dare predict when that'll be available ... |
IMHO
Services are a core functionality in ROS 2. Meaning, I personally think we need a fix for this ASAP. |
* Block rmw_send_response if response reader unknown The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <[email protected]> * Change request headers to use rmw_request_id_t on the wire Signed-off-by: Erik Boasson <[email protected]> * Precise check for matched client/service Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <[email protected]> * Address review comments Signed-off-by: Erik Boasson <[email protected]> * Backwards compatibility * Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times. * Replace ERROR by FAILURE to because of windows.h Signed-off-by: Erik Boasson <[email protected]> * Timeout rmw_send_response after waiting 100ms for discovery The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <[email protected]> * Make iterators "const auto &" Signed-off-by: Erik Boasson <[email protected]> * Add TODO for eliminating rmw_send_response blocking Signed-off-by: Erik Boasson <[email protected]>
* Block rmw_send_response if response reader unknown The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <[email protected]> * Change request headers to use rmw_request_id_t on the wire Signed-off-by: Erik Boasson <[email protected]> * Precise check for matched client/service Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <[email protected]> * Address review comments Signed-off-by: Erik Boasson <[email protected]> * Backwards compatibility * Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times. * Replace ERROR by FAILURE to because of windows.h Signed-off-by: Erik Boasson <[email protected]> * Timeout rmw_send_response after waiting 100ms for discovery The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <[email protected]> * Make iterators "const auto &" Signed-off-by: Erik Boasson <[email protected]> * Add TODO for eliminating rmw_send_response blocking Signed-off-by: Erik Boasson <[email protected]>
* Block rmw_send_response if response reader unknown The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <[email protected]> * Change request headers to use rmw_request_id_t on the wire Signed-off-by: Erik Boasson <[email protected]> * Precise check for matched client/service Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <[email protected]> * Address review comments Signed-off-by: Erik Boasson <[email protected]> * Backwards compatibility * Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times. * Replace ERROR by FAILURE to because of windows.h Signed-off-by: Erik Boasson <[email protected]> * Timeout rmw_send_response after waiting 100ms for discovery The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <[email protected]> * Make iterators "const auto &" Signed-off-by: Erik Boasson <[email protected]> * Add TODO for eliminating rmw_send_response blocking Signed-off-by: Erik Boasson <[email protected]> Co-authored-by: eboasson <[email protected]>
Some re-design discussion took place in ros2/rmw_fastrtps#418. |
Hi, System setup:
Tagging @dirk-thomas for some insight on memory leak. Thanks! |
Bug report
Required Info:
Steps to reproduce issue
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run demo_nodes_cpp add_two_ints_server
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run demo_nodes_cpp add_two_ints_client_async
Expected behavior
A single service call takes place and succeeds i.e. you get logs on each terminal:
[INFO] [add_two_ints_client]: Result of add_two_ints: 5
Actual behavior
The client hangs with no output though the server does log so it appears to have received the request.
Additional information
It does not happen on
rclpy
. It also does not happen if bringing up both nodes vialaunch
e.g.:So it's likely a race.
The text was updated successfully, but these errors were encountered: