-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast-DDS service discovery redesign #418
base: rolling
Are you sure you want to change the base?
Conversation
Signed-off-by: Iker Luengo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. Thanks for pushing for this @IkerLuengo !
design/service-discovery.md
Outdated
|
||
The Service Mapping relies on the built-in discovery service provided by DDS. However, this discovery is not robust and discovery race conditions can occur, resulting in service replies being lost. The reason is that the request and reply topics are independent at DDS level, meaning that they are matched independently. Therefore, it is possible that the request topic entities are matched while the response entities are not fully matched yet. In this situation, if the client makes a request, the response will be lost. | ||
|
||
On the client side this is partially solved checking the result of method `rmw_service_server_is_available` before sending any request. However, current implementation only checks that the request publisher and resonse subscribers are matched to any remote endpoint, but not that these remote endpoints correspond to the same servers. That is, the request publisher could be matched to one server ant the response subsciber to another one, so that any request will still be missing its response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo nit: when you say current implementation, consider referencing a concrete, versioned code.
design/service-discovery.md
Outdated
#### Caveats #### | ||
|
||
* Un-matching of endpoints has to deal with the new internal structures to maintain coherence. For example, removing the server guids from the fully-matched list and possibly moving it to the half-matched list, or cleaning the list of pending requests. | ||
* The algorithm has to be able support remote endpoints that do not add the response GUID on the `USER_DATA`. This is to keep compatibility with older versions and other vendors. In these cases, legacy behavior is acceptable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo nit:
* The algorithm has to be able support remote endpoints that do not add the response GUID on the `USER_DATA`. This is to keep compatibility with older versions and other vendors. In these cases, legacy behavior is acceptable. | |
* The algorithm has to be able to support remote endpoints that do not add the response GUID on the `USER_DATA`. This is to keep compatibility with older versions and other vendors. In these cases, legacy behavior is acceptable. |
design/service-discovery.md
Outdated
|
||
At the moment, the `USER_DATA` is not being used on the RMW endpoints. We can simply add the GUID in text form using `operator<<` and `operator>>` to read and write. | ||
|
||
In order be able to add other information on the `USER_DATA` on the future, we must *tag* the information somehow. The proposed format is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo nit:
In order be able to add other information on the `USER_DATA` on the future, we must *tag* the information somehow. The proposed format is: | |
In order be able to add other information to the `USER_DATA` in the future, we must *tag* the information somehow. The proposed format is: |
design/service-discovery.md
Outdated
|
||
where `responseGUID:` is a string literal and `<GUID>` is the char string form of the GUID of the related endpoint, as formatted by `operator<<`. | ||
|
||
Using `properties` instead of `USER_DATA` would be preferable in this case, but unfortunately `properties` are not available at publisher/subscriber level on Fast DDS at this moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo same as above, it would be nice to be explicit about the time of writing so that this document remains true over time.
design/service-discovery.md
Outdated
### ParticipantListener::onPublisherDiscovery ### | ||
|
||
* When a new publisher is discovered, read the `USER_DATA` to get the corresponding response subcriber's GUID and store the relation on the `remote_response_guid_by_remote_request_guid_` map. | ||
* If the `USER_DATA` dos not contain a GUID, **do not add an entry to the map**. This signals that the remote endpoint is not compiant with these modifications (for backwards compatibility and with other vendors). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo typo:
* If the `USER_DATA` dos not contain a GUID, **do not add an entry to the map**. This signals that the remote endpoint is not compiant with these modifications (for backwards compatibility and with other vendors). | |
* If the `USER_DATA` dos not contain a GUID, **do not add an entry to the map**. This signals that the remote endpoint is not compliant with these modifications (for backwards compatibility and with other vendors). |
|
||
* When a new publisher is discovered, read the `USER_DATA` to get the corresponding response subcriber's GUID and store the relation on the `remote_response_guid_by_remote_request_guid_` map. | ||
* If the `USER_DATA` dos not contain a GUID, **do not add an entry to the map**. This signals that the remote endpoint is not compiant with these modifications (for backwards compatibility and with other vendors). | ||
* When a publisher is un-discovered, remove the entry from the `remote_response_guid_by_remote_request_guid_` map (if any). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo what do you mean by un-discovered? That it goes away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That somehow the subscriber stops considering it as a matched peer. It can be that the peer disconnects or that it does not assert its liveliness. In any case, the subscriber will consider that the publisher is not reachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider replacing un-discovered
by no longer reachable
?
### CustomServiceInfo ### | ||
|
||
* Add a reference to CustomParticipantInfo, to be able to retrieve the relations between the incoming request and the subscriber that will be receiving the response. | ||
* Add `pending_requests_` to hold the requests that are waiting for their response channels to be ready. It will be an unordered multimap with the client response subscriber's GUID as key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo meta: should there be any cap to the amount of pending requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't consider limiting the size of the list because the queue of requests that are sent to process (ServiceListener::list) is not limited either.
design/service-discovery.md
Outdated
* Add `complete_matches_` to keep track of fully matched servers. It will be an unordered map with the server response publisher's GUID as key: | ||
`std::unordered_map<eprosima::fastrtps::rtps::GUID_t, eprosima::fastrtps::rtps::GUID_t, rmw_fastrtps_shared_cpp::hash_fastrtps_guid>` | ||
|
||
* Add `complete_matches_count_` to hold the size of `complete_matches_`, to be used on `rmw_service_server_is_available`. It will be an atomic variable `std::atomic_size_t`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo this is meant for locks during complete_matches_
updates not to penalize rmw_service_server_is_available()
calls, right? If so, a one sentence explanation would be nice.
design/service-discovery.md
Outdated
* When a response publisher is found, if the remote GUID is found on `pending_matches_`, we are about to complete the discovery for the response topic. Remove the entry from `pending_matches_` and move it to `complete_matches_`. | ||
* In the case of an unmatch, if the remote GUID is found on `complete_matches_`, we are about to have a half-discovery for the response topic. Remove the entry from `complete_matches_` and move it to `pending_matches_`. | ||
* Note that having only the response subscriber matched is not being tracked as pending match, so: | ||
* In the case of a match, if the remote GUID is not on `pending_matches_`, it means that either the request subscriber is still umatched or that the remote server does not implement this solution. In either case there is nothing to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo did you mean
* In the case of a match, if the remote GUID is not on `pending_matches_`, it means that either the request subscriber is still umatched or that the remote server does not implement this solution. In either case there is nothing to do. | |
* In the case of a match, if the remote GUID is not on `pending_matches_`, it means that either the request publisher is still umatched or that the remote server does not implement this solution. In either case there is nothing to do. |
considering we're looking at it from the client POV? Same below.
Signed-off-by: Iker Luengo <[email protected]>
Corrected as per suggestions from @hidmic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal sounds correct, thanks for working on it @IkerLuengo.
I think that much of the logic can be implemented in a vendor DDS agnostic way, by either using templates or using opaqued abstractions.
Having common logic will avoid re-implementing the same for other DDS vendors, but I understand you're probably only focused on getting it working for rmw_fastrtps
.
|
||
#### Note on the GUID format on the USER_DATA #### | ||
|
||
At the moment, the `USER_DATA` is not being used on the RMW endpoints. We can simply add the GUID in text form using `operator<<` and `operator>>` to read and write. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does all vendor use the same text format?
It would be great if the used format can be in the future cross-vendor compatible (services aren't cross-vendor compatible now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no standardized text format for the GUI (the specification only describes it as a 16 octet value). If we are looking for future interoperability between vendors, raw octet values can be used. As the size of the GUID is fixed to 16 octets, there should be no problems with the parsing.
In any case, some kind of prefix/delimiter must be used to be able to parse the GUID info from other USER_DATA
that may be added. We used the responseGUID:
string as starting delimiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but let's wait for @ivanpauno, and perhaps @wjwwood or @jacobperron.
if (complete_matches_.find(response_publisher_guid)) then (yes) | ||
: pending_matches_[response_publisher_guid] = complete_matches_[response_publisher_guid]; | ||
: complete_matches_.erase(response_publisher_guid); | ||
: complete_matches_count_.store(complete_matches_.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo I think complete_matches_count_
should be decremented before complete_matches_
is modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Also corrected on ClientPubListener::OnPublicationMatched.
Changed the store
operation with the size of the map with fetch_add(1)
and fetch_sub(1)
.
design/service-discovery.md
Outdated
* Add `complete_matches_` to keep track of fully matched servers. It will be an unordered map with the server response publisher's GUID as key: | ||
`std::unordered_map<eprosima::fastrtps::rtps::GUID_t, eprosima::fastrtps::rtps::GUID_t, rmw_fastrtps_shared_cpp::hash_fastrtps_guid>` | ||
|
||
* Add `complete_matches_count_` to hold the size of `complete_matches_`, to be used on `rmw_service_server_is_available`. It will be an atomic variable `std::atomic_size_t`. This variable will be updated every time an entry is added or removed in `complete_matches_`, and its purpose is to avoid `rmw_service_server_is_available` competing for locks to `complete_matches_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IkerLuengo I think that std::atomic_size_t
should be configured with std::memory_order_seq_cst
. If so, please note it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted
LGTM with @hidmic comments addressed |
@IkerLuengo friendly ping. |
Hi Michel,
Iker come back tomorrow from vacation.
El lun., 24 ago. 2020 18:26, Michel Hidalgo <[email protected]>
escribió:
… @IkerLuengo <https://github.com/IkerLuengo> friendly ping.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#418 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3AUVIHT2DL3XU4VIHIPSTSCKIDFANCNFSM4PPZGEKA>
.
|
Signed-off-by: Iker Luengo <[email protected]>
Modified as suggested by @hidmic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think the idea makes a lot of sense. Thanks for pushing on this.
I have a few concerns inline which are mostly about some implementation details, but nothing that would prevent us from going forward with this idea.
Finally, @eboasson I know this particular article is about Fast DDS, but I think a similar idea would apply to Cyclone DDS as well. You've mentioned as much in some of your previous feedback (like ros2/rmw_cyclonedds#187 (comment)). Would you mind taking a look here and leaving your thoughts? If this becomes a generic solution, then I would take the implementation-independent parts out of this document and put them somewhere more generic (maybe https://design.ros2.org). Thanks.
|
||
### General description of the solution ### | ||
|
||
The client will create the response subscriber before the request publisher. When creating the request publisher, it will insert the response subscriber's GUID on the `USER_DATA` of the publisher, so that the server can know both remote endoints belong to the same client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client will create the response subscriber before the request publisher. When creating the request publisher, it will insert the response subscriber's GUID on the `USER_DATA` of the publisher, so that the server can know both remote endoints belong to the same client. | |
The client will create the response subscriber before the request publisher. When creating the request publisher, it will insert the response subscriber's GUID on the `USER_DATA` of the publisher, so that the server can know both remote endpoints belong to the same client. |
|
||
## Server side ## | ||
|
||
On the server side the general idea is to hold the incoming requests until the response publisher has matched with the response subscriber **that corresponds to the request publisher** sending the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on the server side if this never happens? That is, assume that the client comes up, establishes the request publication/subscription, sends a request and then crashes before the response subscription is fully setup? Will the server side hold on to the request forever?
This method creates the request subscriber first and the response publisher afterwards. | ||
The order of the creations must be inverted, so that we can get the publisher's GUID and store it on the `USER_DATA` on the `subscriberParam` before creating the subscriber. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit confusing. In particular, all of the other sections here describe what is going to be changed, while this one starts out by describing the current situation. I'll suggest just changing this to:
This method creates the request subscriber first and the response publisher afterwards. | |
The order of the creations must be inverted, so that we can get the publisher's GUID and store it on the `USER_DATA` on the `subscriberParam` before creating the subscriber. | |
The response publisher must be created before the request subscriber so that we can get the publisher's GUID and store it in the `USER_DATA` of the `subscriberParam` before creating the subscriber. |
In order be able to add other information to the `USER_DATA` in the future, we must *tag* the information somehow. The proposed format is: | ||
|
||
``` | ||
responseGUID:<GUID> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to bikeshed too much, but I'll suggest that this include the word service in it somehow. Maybe:
responseGUID:<GUID> | |
serviceresponseGUID:<GUID> |
#### rmw_service_server_is_available #### | ||
|
||
* If the `complete_matches_` map contains elements, there is at least one server fully matched. | ||
* Else, either there is none fully matched or there are matched server that do not implement this solution. Just in case, we revert to legacy behavior, checking the number of remote and local endpoints, but without ensuring they match with each other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but doesn't this fallback behavior mean that we can still have the racing problem with a fully-compliant solution? That is, if both the client and the server implement this solution, but the client is half-matched, rmw_service_server_is_available
will still return true (because of the legacy behavior). Or am I missing something?
There are three independent problems here:
Probably the most sensible thing to do handle client and service operations and discovery (as well as their possible disappearance) in a common library (so handling reading/writing of requests and responses, and modelling the availability of a new request/response by triggering a guard condition). This could be the |
👍 The client/service id (here) solution sounds more elegant.
👍 I would also use the same format.
👍
I'm not sure I understand, would the client have a guard condition to indicate if the server availability changed or something like that?
The discovery issue in the "remote" side sounds like a DDS specific problem (maybe it can happen in other rmw implementations, but it doesn't sound completely general), all other changes to handle "network connectivity issues" that are not DDS specific I think the best thing to do is to handle them in rcl (which will likely involve extending rcl API). |
I'm not sure I understand this bit. Fundamentally, there is a race that exists here because you have 2 independent topics that implement the service. Both this solution (and the slightly different one in Cyclone) resolve that race by binding the two topics together somehow. How would you resolve this differently in the specification?
Yeah, agreed here. I can definitely imagine other RMWs where the service is a true RPC call (more like DDS-RPC), and so you don't have this particular issue. So I think this part makes sense to implement in |
What I was thinking of is the following: suppose the service defers requests from clients for which it hasn’t yet discovered the response reader, then the service implementation within the RMW layer suddenly has to monitor two sources of requests. The first is that of the requests arriving at the data reader for requests, and the second is that of the deferred requests for which the discovery happens to now have completed. In the implementation, the application typically sits there waiting in the Reworking the request handling to respond to the discovery information exchanged already by Such an implementation would ideally be done in the
I agree that the particular problem is quite DDS-specific at the moment. My view is that DDS ought to be improved to allow waiting until the remote has discovered the local entities (at which point (@clalancette, perhaps this also answers your question?) |
Ah yeah, that's true.
That sounds reasonable to me. Unrelated note: about the impedance mismatch between DDS and ROS 2 waitset, that triggered the discussions in ros2/design#305 and in this discourse post. |
@IkerLuengo friendly ping ! |
This is the design for the solution to #392.
On the server, the requests are held back until the response subscriber is matched. Once the matching occurs, all the pending requests corresponding to that client are sent to processing.
On the client, a list of fully-matched servers is kept, i.e., those for which both the request subscriber and the response publisher are matched. Only if this list is not empty does rmw_service_server_is_available return true.
The correspondence between publisher and subscriber GUIDs on one endpoint (server or client) is shared with the other endpoint through
USER_DATA
. If the remote endpoint does not have this solution implemented (does not share the GUID correspondence), legacy behavior is kept.