Fast-DDS service discovery redesign #418

IkerLuengo · 2020-07-31T08:41:17Z

This is the design for the solution to #392.

On the server, the requests are held back until the response subscriber is matched. Once the matching occurs, all the pending requests corresponding to that client are sent to processing.

On the client, a list of fully-matched servers is kept, i.e., those for which both the request subscriber and the response publisher are matched. Only if this list is not empty does rmw_service_server_is_available return true.

The correspondence between publisher and subscriber GUIDs on one endpoint (server or client) is shared with the other endpoint through USER_DATA. If the remote endpoint does not have this solution implemented (does not share the GUID correspondence), legacy behavior is kept.

Signed-off-by: Iker Luengo <[email protected]>

hidmic

I left some comments. Thanks for pushing for this @IkerLuengo !

hidmic · 2020-07-31T14:05:55Z

design/service-discovery.md

+
+The Service Mapping relies on the built-in discovery service provided by DDS. However, this discovery is not robust and discovery race conditions can occur, resulting in service replies being lost. The reason is that the request and reply topics are independent at DDS level, meaning that they are matched independently. Therefore, it is possible that the request topic entities are matched while the response entities are not fully matched yet. In this situation, if the client makes a request, the response will be lost.
+
+On the client side this is partially solved checking the result of method `rmw_service_server_is_available` before sending any request. However, current implementation only checks that the request publisher and resonse subscribers are matched to any remote endpoint, but not that these remote endpoints correspond to the same servers. That is, the request publisher could be matched to one server ant the response subsciber to another one, so that any request will still be missing its response.


@IkerLuengo nit: when you say current implementation, consider referencing a concrete, versioned code.

hidmic · 2020-07-31T14:12:22Z

design/service-discovery.md

+#### Caveats ####
+
+* Un-matching of endpoints has to deal with the new internal structures to maintain coherence. For example, removing the server guids from the fully-matched list and possibly moving it to the half-matched list, or cleaning the list of pending requests.
+* The algorithm has to be able support remote endpoints that do not add the response GUID on the `USER_DATA`. This is to keep compatibility with older versions and other vendors. In these cases, legacy behavior is acceptable.


@IkerLuengo nit:

Suggested change

* The algorithm has to be able support remote endpoints that do not add the response GUID on the `USER_DATA`. This is to keep compatibility with older versions and other vendors. In these cases, legacy behavior is acceptable.

* The algorithm has to be able to support remote endpoints that do not add the response GUID on the `USER_DATA`. This is to keep compatibility with older versions and other vendors. In these cases, legacy behavior is acceptable.

hidmic · 2020-07-31T14:13:15Z

design/service-discovery.md

+
+At the moment, the `USER_DATA` is not being used on the RMW endpoints. We can simply add the GUID in text form using `operator<<` and `operator>>` to read and write.
+
+In order be able to add other information on the `USER_DATA` on the future, we must *tag* the information somehow. The proposed format is:


@IkerLuengo nit:

Suggested change

In order be able to add other information on the `USER_DATA` on the future, we must *tag* the information somehow. The proposed format is:

In order be able to add other information to the `USER_DATA` in the future, we must *tag* the information somehow. The proposed format is:

hidmic · 2020-07-31T14:15:03Z

design/service-discovery.md

+
+where `responseGUID:` is a string literal and `<GUID>` is the char string form of the GUID of the related endpoint, as formatted by `operator<<`.
+
+Using `properties` instead of `USER_DATA` would be preferable in this case, but unfortunately `properties` are not available at publisher/subscriber level on Fast DDS at this moment.


@IkerLuengo same as above, it would be nice to be explicit about the time of writing so that this document remains true over time.

design/service-discovery.md

hidmic · 2020-07-31T14:28:48Z

design/service-discovery.md

+### ParticipantListener::onPublisherDiscovery ###
+
+* When a new publisher is discovered, read the `USER_DATA` to get the corresponding response subcriber's GUID and store the relation on the `remote_response_guid_by_remote_request_guid_` map.
+* If the `USER_DATA` dos not contain a GUID, **do not add an entry to the map**. This signals that the remote endpoint is not compiant with these modifications (for backwards compatibility and with other vendors).


@IkerLuengo typo:

Suggested change

* If the `USER_DATA` dos not contain a GUID, **do not add an entry to the map**. This signals that the remote endpoint is not compiant with these modifications (for backwards compatibility and with other vendors).

* If the `USER_DATA` dos not contain a GUID, **do not add an entry to the map**. This signals that the remote endpoint is not compliant with these modifications (for backwards compatibility and with other vendors).

hidmic · 2020-07-31T14:29:25Z

design/service-discovery.md

+
+* When a new publisher is discovered, read the `USER_DATA` to get the corresponding response subcriber's GUID and store the relation on the `remote_response_guid_by_remote_request_guid_` map.
+* If the `USER_DATA` dos not contain a GUID, **do not add an entry to the map**. This signals that the remote endpoint is not compiant with these modifications (for backwards compatibility and with other vendors).
+* When a publisher is un-discovered, remove the entry from the `remote_response_guid_by_remote_request_guid_` map (if any).


@IkerLuengo what do you mean by un-discovered? That it goes away?

That somehow the subscriber stops considering it as a matched peer. It can be that the peer disconnects or that it does not assert its liveliness. In any case, the subscriber will consider that the publisher is not reachable.

Would you consider replacing un-discovered by no longer reachable?

hidmic · 2020-07-31T14:32:59Z

design/service-discovery.md

+### CustomServiceInfo ###
+
+* Add a reference to CustomParticipantInfo, to be able to retrieve the relations between the incoming request and the subscriber that will be receiving the response.
+* Add `pending_requests_` to hold the requests that are waiting for their response channels to be ready. It will be an unordered multimap with the client response subscriber's GUID as key:


@IkerLuengo meta: should there be any cap to the amount of pending requests?

We didn't consider limiting the size of the list because the queue of requests that are sent to process (ServiceListener::list) is not limited either.

hidmic · 2020-07-31T15:43:28Z

design/service-discovery.md

+* Add `complete_matches_` to keep track of fully matched servers. It will be an unordered map with the server response publisher's GUID as key:
+  `std::unordered_map<eprosima::fastrtps::rtps::GUID_t, eprosima::fastrtps::rtps::GUID_t, rmw_fastrtps_shared_cpp::hash_fastrtps_guid>`
+
+* Add `complete_matches_count_` to hold the size of `complete_matches_`, to be used on `rmw_service_server_is_available`. It will be an atomic variable `std::atomic_size_t`.


@IkerLuengo this is meant for locks during complete_matches_ updates not to penalize rmw_service_server_is_available() calls, right? If so, a one sentence explanation would be nice.

hidmic · 2020-07-31T15:46:25Z

design/service-discovery.md

+* When a response publisher is found, if the remote GUID is found on `pending_matches_`, we are about to complete the discovery for the response topic. Remove the entry from `pending_matches_` and move it to `complete_matches_`.
+* In the case of an unmatch, if the remote GUID is found on `complete_matches_`, we are about to have a half-discovery for the response topic. Remove the entry from `complete_matches_` and move it to `pending_matches_`.
+* Note that having only the response subscriber matched is not being tracked as pending match, so:
+  * In the case of a match, if the remote GUID is not on `pending_matches_`, it means that either the request subscriber is still umatched or that the remote server does not implement this solution. In either case there is nothing to do.


@IkerLuengo did you mean

Suggested change

* In the case of a match, if the remote GUID is not on `pending_matches_`, it means that either the request subscriber is still umatched or that the remote server does not implement this solution. In either case there is nothing to do.

* In the case of a match, if the remote GUID is not on `pending_matches_`, it means that either the request publisher is still umatched or that the remote server does not implement this solution. In either case there is nothing to do.

considering we're looking at it from the client POV? Same below.

Signed-off-by: Iker Luengo <[email protected]>

IkerLuengo · 2020-08-03T10:09:43Z

Corrected as per suggestions from @hidmic

ivanpauno

The proposal sounds correct, thanks for working on it @IkerLuengo.

I think that much of the logic can be implemented in a vendor DDS agnostic way, by either using templates or using opaqued abstractions.
Having common logic will avoid re-implementing the same for other DDS vendors, but I understand you're probably only focused on getting it working for rmw_fastrtps.

ivanpauno · 2020-08-03T13:25:01Z

design/service-discovery.md

+
+#### Note on the GUID format on the USER_DATA ####
+
+At the moment, the `USER_DATA` is not being used on the RMW endpoints. We can simply add the GUID in text form using `operator<<` and `operator>>` to read and write.


Does all vendor use the same text format?
It would be great if the used format can be in the future cross-vendor compatible (services aren't cross-vendor compatible now)

There is no standardized text format for the GUI (the specification only describes it as a 16 octet value). If we are looking for future interoperability between vendors, raw octet values can be used. As the size of the GUID is fixed to 16 octets, there should be no problems with the parsing.

In any case, some kind of prefix/delimiter must be used to be able to parse the GUID info from other USER_DATA that may be added. We used the responseGUID: string as starting delimiter.

hidmic

LGTM, but let's wait for @ivanpauno, and perhaps @wjwwood or @jacobperron.

hidmic · 2020-08-07T20:03:25Z

design/service-discovery-images/ClientListener-OnSubscriptionMatched.svg

+  if (complete_matches_.find(response_publisher_guid)) then (yes)
+    : pending_matches_[response_publisher_guid] = complete_matches_[response_publisher_guid];
+    : complete_matches_.erase(response_publisher_guid);
+    : complete_matches_count_.store(complete_matches_.size());


@IkerLuengo I think complete_matches_count_ should be decremented before complete_matches_ is modified.

Good catch. Also corrected on ClientPubListener::OnPublicationMatched.
Changed the store operation with the size of the map with fetch_add(1) and fetch_sub(1).

hidmic · 2020-08-07T20:04:56Z

design/service-discovery.md

+* Add `complete_matches_` to keep track of fully matched servers. It will be an unordered map with the server response publisher's GUID as key:
+  `std::unordered_map<eprosima::fastrtps::rtps::GUID_t, eprosima::fastrtps::rtps::GUID_t, rmw_fastrtps_shared_cpp::hash_fastrtps_guid>`
+
+* Add `complete_matches_count_` to hold the size of `complete_matches_`, to be used on `rmw_service_server_is_available`. It will be an atomic variable `std::atomic_size_t`. This variable will be updated every time an entry is added or removed in `complete_matches_`, and its purpose is to avoid `rmw_service_server_is_available` competing for locks to `complete_matches_`.


@IkerLuengo I think that std::atomic_size_t should be configured with std::memory_order_seq_cst. If so, please note it.

ivanpauno · 2020-08-07T20:20:19Z

LGTM with @hidmic comments addressed

hidmic · 2020-08-24T16:26:10Z

@IkerLuengo friendly ping.

JaimeMartin · 2020-08-24T17:00:10Z

Hi Michel, Iker come back tomorrow from vacation. El lun., 24 ago. 2020 18:26, Michel Hidalgo <[email protected]> escribió:

…

@IkerLuengo <https://github.com/IkerLuengo> friendly ping. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#418 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3AUVIHT2DL3XU4VIHIPSTSCKIDFANCNFSM4PPZGEKA> .

Signed-off-by: Iker Luengo <[email protected]>

IkerLuengo · 2020-08-25T09:14:35Z

Modified as suggested by @hidmic.

clalancette

Overall, I think the idea makes a lot of sense. Thanks for pushing on this.

I have a few concerns inline which are mostly about some implementation details, but nothing that would prevent us from going forward with this idea.

Finally, @eboasson I know this particular article is about Fast DDS, but I think a similar idea would apply to Cyclone DDS as well. You've mentioned as much in some of your previous feedback (like ros2/rmw_cyclonedds#187 (comment)). Would you mind taking a look here and leaving your thoughts? If this becomes a generic solution, then I would take the implementation-independent parts out of this document and put them somewhere more generic (maybe https://design.ros2.org). Thanks.

clalancette · 2020-11-04T15:15:21Z

design/service-discovery.md

+
+### General description of the solution ###
+
+The client will create the response subscriber before the request publisher. When creating the request publisher, it will insert the response subscriber's GUID on the `USER_DATA` of the publisher, so that the server can know both remote endoints belong to the same client.


Suggested change

The client will create the response subscriber before the request publisher. When creating the request publisher, it will insert the response subscriber's GUID on the `USER_DATA` of the publisher, so that the server can know both remote endoints belong to the same client.

The client will create the response subscriber before the request publisher. When creating the request publisher, it will insert the response subscriber's GUID on the `USER_DATA` of the publisher, so that the server can know both remote endpoints belong to the same client.

clalancette · 2020-11-04T15:17:18Z

design/service-discovery.md

+
+## Server side ##
+
+On the server side the general idea is to hold the incoming requests until the response publisher has matched with the response subscriber **that corresponds to the request publisher** sending the request.


What happens on the server side if this never happens? That is, assume that the client comes up, establishes the request publication/subscription, sends a request and then crashes before the response subscription is fully setup? Will the server side hold on to the request forever?

clalancette · 2020-11-04T15:26:19Z

design/service-discovery.md

+This method creates the request subscriber first and the response publisher afterwards.
+The order of the creations must be inverted, so that we can get the publisher's GUID and store it on the `USER_DATA` on the `subscriberParam` before creating the subscriber.


I think this is a bit confusing. In particular, all of the other sections here describe what is going to be changed, while this one starts out by describing the current situation. I'll suggest just changing this to:

Suggested change

This method creates the request subscriber first and the response publisher afterwards.

The order of the creations must be inverted, so that we can get the publisher's GUID and store it on the `USER_DATA` on the `subscriberParam` before creating the subscriber.

The response publisher must be created before the request subscriber so that we can get the publisher's GUID and store it in the `USER_DATA` of the `subscriberParam` before creating the subscriber.

clalancette · 2020-11-04T15:29:04Z

design/service-discovery.md

+In order be able to add other information to the `USER_DATA` in the future, we must *tag* the information somehow. The proposed format is:
+
+```
+  responseGUID:<GUID>


I don't want to bikeshed too much, but I'll suggest that this include the word service in it somehow. Maybe:

Suggested change

responseGUID:<GUID>

serviceresponseGUID:<GUID>

clalancette · 2020-11-04T15:39:02Z

design/service-discovery.md

+#### rmw_service_server_is_available ####
+
+* If the `complete_matches_` map contains elements, there is at least one server fully matched.
+* Else, either there is none fully matched or there are matched server that do not implement this solution. Just in case, we revert to legacy behavior, checking the number of remote and local endpoints, but without ensuring they match with each other.


Hm, but doesn't this fallback behavior mean that we can still have the racing problem with a fully-compliant solution? That is, if both the client and the server implement this solution, but the client is half-matched, rmw_service_server_is_available will still return true (because of the legacy behavior). Or am I missing something?

eboasson · 2020-11-09T09:56:27Z

Overall, I think the idea makes a lot of sense. Thanks for pushing on this.

I have a few concerns inline which are mostly about some implementation details, but nothing that would prevent us from going forward with this idea.

Finally, @eboasson I know this particular article is about Fast DDS, but I think a similar idea would apply to Cyclone DDS as well. You've mentioned as much in some of your previous feedback (like ros2/rmw_cyclonedds#187 (comment)). Would you mind taking a look here and leaving your thoughts? If this becomes a generic solution, then I would take the implementation-independent parts out of this document and put them somewhere more generic (maybe https://design.ros2.org). Thanks.

There are three independent problems here:

Associating the two endpoints with the client/service. What is proposed here is not so different from what Cyclone's RMW currently implements, but:
1. C. adds a clientid/serviceid "property" to the USER_DATA QoS for the client/service's request and response endpoints, which I think is much more elegant because any reader/writer can immediately be associated with a client/service.
2. C. follows the formatting of the USER_DATA of the participant. (Originally used for node name/type, today only used for the enclave property.) The content is the participant's GUID prefix followed by a unique number, formatted as a series of 2-digits hex numbers separated by dots, but any unique string will do.
Discovery of the service by the client being independent of discovery of the client by the service. Receipt of a request implies most of the client has been discovered, just not that the client's response reader has been discovered yet. Here, C. chooses to delay the sending of the response rather than the handling of the request, to avoids having to store and track deferred requests. In practice the wait generally ends up being a very short delay, and that only the first time a new client issues a request immediately after being created.
The fundamental problem here is the (current) inability to detect that the remote side has completed discovery. The effort spent on designing workarounds (which this discovery redesign is, too) would arguably be better spent on fixing DDS for guaranteeing this. It is something that can be dealt with within the DDS implementations, the only thing requiring updates to the specifications is providing such a guarantee across implementations.
The ROS2 service model doesn't really allow for network connectivity issues: once "service_is_available" returns true, the assumption is that a service is and will remain available, and that sending a request will result in receiving a response. There are failure cases where the client will not be able to detect with certainty that a response will not be received (it is a bit of an edge case, but it can happen if the service loses sight of the client temporarily, especially in combination with multiple outstanding requests). Most cases can be detected, but at a pretty significant increase in complexity in the RMW implementation.

Probably the most sensible thing to do handle client and service operations and discovery (as well as their possible disappearance) in a common library (so handling reading/writing of requests and responses, and modelling the availability of a new request/response by triggering a guard condition). This could be the rmw_dds_common library, but I suspect that gettings this bit right is an issue for more than just the DDS RMW layers and it might well make sense to do it in rcl (or some other, new library).

ivanpauno · 2020-11-09T14:42:12Z

C. adds a clientid/serviceid "property" to the USER_DATA QoS for the client/service's request and response endpoints, which I think is much more elegant because any reader/writer can immediately be associated with a client/service.

👍 The client/service id (here) solution sounds more elegant.

C. follows the formatting of the USER_DATA of the participant.

👍 I would also use the same format.

The ROS2 service model doesn't really allow for network connectivity issues: once "service_is_available" returns true, the assumption is that a service is and will remain available, and that sending a request will result in receiving a response

👍

Probably the most sensible thing to do handle client and service operations and discovery (as well as their possible disappearance) in a common library (so handling reading/writing of requests and responses, and modelling the availability of a new request/response by triggering a guard condition)

I'm not sure I understand, would the client have a guard condition to indicate if the server availability changed or something like that?

This could be the rmw_dds_common library, but I suspect that gettings this bit right is an issue for more than just the DDS RMW layers and it might well make sense to do it in rcl

The discovery issue in the "remote" side sounds like a DDS specific problem (maybe it can happen in other rmw implementations, but it doesn't sound completely general), all other changes to handle "network connectivity issues" that are not DDS specific I think the best thing to do is to handle them in rcl (which will likely involve extending rcl API).

clalancette · 2020-11-09T16:35:25Z

2\. The fundamental problem here is the (current) inability to detect that the remote side has completed discovery. The effort spent on designing workarounds (which this discovery redesign is, too) would arguably be better spent on fixing DDS for guaranteeing this. It is something that can be dealt with within the DDS implementations, the only thing requiring updates to the specifications is providing such a guarantee across implementations.

I'm not sure I understand this bit. Fundamentally, there is a race that exists here because you have 2 independent topics that implement the service. Both this solution (and the slightly different one in Cyclone) resolve that race by binding the two topics together somehow. How would you resolve this differently in the specification?

The discovery issue in the "remote" side sounds like a DDS specific problem (maybe it can happen in other rmw implementations, but it doesn't sound completely general), all other changes to handle "network connectivity issues" that are not DDS specific I think the best thing to do is to handle them in rcl (which will likely involve extending rcl API).

Yeah, agreed here. I can definitely imagine other RMWs where the service is a true RPC call (more like DDS-RPC), and so you don't have this particular issue. So I think this part makes sense to implement in rmw_dds_common.

eboasson · 2020-11-09T19:12:04Z

Probably the most sensible thing to do handle client and service operations and discovery (as well as their possible disappearance) in a common library (so handling reading/writing of requests and responses, and modelling the availability of a new request/response by triggering a guard condition)

I'm not sure I understand, would the client have a guard condition to indicate if the server availability changed or something like that?

What I was thinking of is the following: suppose the service defers requests from clients for which it hasn’t yet discovered the response reader, then the service implementation within the RMW layer suddenly has to monitor two sources of requests. The first is that of the requests arriving at the data reader for requests, and the second is that of the deferred requests for which the discovery happens to now have completed.

In the implementation, the application typically sits there waiting in the rmw_wait operation, but that one (ideally) maps to a DDS waitset (the impedance mismatch between the DDS waitset and the RMW waitset will hopefully get resolved at some point). The set of deferred requests lives outside DDS, and so you’d need some mechanism to trigger the waitset when a deferred request becomes ready. There are some complications on that front, none of them insurmountable, but they do add a fair amount of complexity.

Reworking the request handling to respond to the discovery information exchanged already by dds_rmw_common and the receipt of requests, but driven by a separate thread inside the RMW implementation (much like the discovery thread), would make life quite straightforward again. Then, any time a request becomes ready, that common implementation could simply trigger a guard condition — and at that point, as far as the waitset is concerned, a service is simply associated with a guard condition that gets triggered whenever there is a request waiting for the service.

Such an implementation would ideally be done in the dds_rmw_common, I’d say. Unless:

This could be the rmw_dds_common library, but I suspect that gettings this bit right is an issue for more than just the DDS RMW layers and it might well make sense to do it in rcl

The discovery issue in the "remote" side sounds like a DDS specific problem (maybe it can happen in other rmw implementations, but it doesn't sound completely general), all other changes to handle "network connectivity issues" that are not DDS specific I think the best thing to do is to handle them in rcl (which will likely involve extending rcl API).

I agree that the particular problem is quite DDS-specific at the moment. My view is that DDS ought to be improved to allow waiting until the remote has discovered the local entities (at which point service_is_available could just build on that, without a need to defer requests or delay sending the response). My thinking was that if DDS got this wrong for 15 years and counting, perhaps there are other middlewares that didn’t get this quite right either, and that it could be useful to have a service mechanism that takes care of this problem more generally. If so moving it into rcl or a new library could be useful. But it could equally well be done in rmw_dds_common first, it can always be generalized later.

(@clalancette, perhaps this also answers your question?)

ivanpauno · 2020-11-11T15:24:26Z

In the implementation, the application typically sits there waiting in the rmw_wait operation, but that one (ideally) maps to a DDS waitset (the impedance mismatch between the DDS waitset and the RMW waitset will hopefully get resolved at some point). The set of deferred requests lives outside DDS, and so you’d need some mechanism to trigger the waitset when a deferred request becomes ready. There are some complications on that front, none of them insurmountable, but they do add a fair amount of complexity.

Ah yeah, that's true.
There's a "manually implemented" wait set using a mutex/cond_var pair and listeners in rmw_fastrtps, so in the case here it's easier to "trigger" the wait set.

Reworking the request handling to respond to the discovery information exchanged already by dds_rmw_common and the receipt of requests, but driven by a separate thread inside the RMW implementation (much like the discovery thread), would make life quite straightforward again. Then, any time a request becomes ready, that common implementation could simply trigger a guard condition — and at that point, as far as the waitset is concerned, a service is simply associated with a guard condition that gets triggered whenever there is a request waiting for the service.

That sounds reasonable to me.

Unrelated note: about the impedance mismatch between DDS and ROS 2 waitset, that triggered the discussions in ros2/design#305 and in this discourse post.

hidmic · 2021-05-31T14:20:28Z

@IkerLuengo friendly ping !

Fast-DDS service discovery redesign

a18257e

Signed-off-by: Iker Luengo <[email protected]>

IkerLuengo mentioned this pull request Jul 31, 2020

Improve service discovery #392

Open

hidmic reviewed Jul 31, 2020

View reviewed changes

corrections as suggested on review

1029eb3

Signed-off-by: Iker Luengo <[email protected]>

ivanpauno reviewed Aug 3, 2020

View reviewed changes

hidmic approved these changes Aug 7, 2020

View reviewed changes

ivanpauno approved these changes Aug 7, 2020

View reviewed changes

complete_matches_count_ tuning

7af84b2

Signed-off-by: Iker Luengo <[email protected]>

hidmic approved these changes Aug 25, 2020

View reviewed changes

clalancette assigned wjwwood and dirk-thomas Sep 10, 2020

JLBuenoLopez mentioned this pull request Oct 21, 2020

Discriminate when the Client has gone from when the Client has not completely matched #467

Merged

dirk-thomas removed their assignment Oct 29, 2020

clalancette reviewed Nov 4, 2020

View reviewed changes

hidmic mentioned this pull request Nov 20, 2020

Not getting service responses reliably when using CycloneDDS ros2/rmw_cyclonedds#74

Open

EduPonz mentioned this pull request Apr 18, 2022

Unhandled Terminate w/ Services Over Wifi Network ros2/ros2#1253

Closed

audrow changed the base branch from master to rolling June 28, 2022 14:22


		The Service Mapping relies on the built-in discovery service provided by DDS. However, this discovery is not robust and discovery race conditions can occur, resulting in service replies being lost. The reason is that the request and reply topics are independent at DDS level, meaning that they are matched independently. Therefore, it is possible that the request topic entities are matched while the response entities are not fully matched yet. In this situation, if the client makes a request, the response will be lost.

		On the client side this is partially solved checking the result of method `rmw_service_server_is_available` before sending any request. However, current implementation only checks that the request publisher and resonse subscribers are matched to any remote endpoint, but not that these remote endpoints correspond to the same servers. That is, the request publisher could be matched to one server ant the response subsciber to another one, so that any request will still be missing its response.

	* The algorithm has to be able support remote endpoints that do not add the response GUID on the `USER_DATA`. This is to keep compatibility with older versions and other vendors. In these cases, legacy behavior is acceptable.
	* The algorithm has to be able to support remote endpoints that do not add the response GUID on the `USER_DATA`. This is to keep compatibility with older versions and other vendors. In these cases, legacy behavior is acceptable.


		At the moment, the `USER_DATA` is not being used on the RMW endpoints. We can simply add the GUID in text form using `operator<<` and `operator>>` to read and write.

		In order be able to add other information on the `USER_DATA` on the future, we must tag the information somehow. The proposed format is:


		where `responseGUID:` is a string literal and `<GUID>` is the char string form of the GUID of the related endpoint, as formatted by `operator<<`.

		Using `properties` instead of `USER_DATA` would be preferable in this case, but unfortunately `properties` are not available at publisher/subscriber level on Fast DDS at this moment.

	* If the `USER_DATA` dos not contain a GUID, do not add an entry to the map. This signals that the remote endpoint is not compiant with these modifications (for backwards compatibility and with other vendors).
	* If the `USER_DATA` dos not contain a GUID, do not add an entry to the map. This signals that the remote endpoint is not compliant with these modifications (for backwards compatibility and with other vendors).

	* In the case of a match, if the remote GUID is not on `pending_matches_`, it means that either the request subscriber is still umatched or that the remote server does not implement this solution. In either case there is nothing to do.
	* In the case of a match, if the remote GUID is not on `pending_matches_`, it means that either the request publisher is still umatched or that the remote server does not implement this solution. In either case there is nothing to do.


		#### Note on the GUID format on the USER_DATA ####

		At the moment, the `USER_DATA` is not being used on the RMW endpoints. We can simply add the GUID in text form using `operator<<` and `operator>>` to read and write.


		### General description of the solution ###

		The client will create the response subscriber before the request publisher. When creating the request publisher, it will insert the response subscriber's GUID on the `USER_DATA` of the publisher, so that the server can know both remote endoints belong to the same client.


		## Server side ##

		On the server side the general idea is to hold the incoming requests until the response publisher has matched with the response subscriber that corresponds to the request publisher sending the request.

		This method creates the request subscriber first and the response publisher afterwards.
		The order of the creations must be inverted, so that we can get the publisher's GUID and store it on the `USER_DATA` on the `subscriberParam` before creating the subscriber.

	This method creates the request subscriber first and the response publisher afterwards.
	The order of the creations must be inverted, so that we can get the publisher's GUID and store it on the `USER_DATA` on the `subscriberParam` before creating the subscriber.
	The response publisher must be created before the request subscriber so that we can get the publisher's GUID and store it in the `USER_DATA` of the `subscriberParam` before creating the subscriber.

Fast-DDS service discovery redesign #418

Are you sure you want to change the base?

Fast-DDS service discovery redesign #418

Conversation

IkerLuengo commented Jul 31, 2020

hidmic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IkerLuengo commented Aug 3, 2020

ivanpauno left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hidmic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IkerLuengo Aug 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanpauno commented Aug 7, 2020

hidmic commented Aug 24, 2020

JaimeMartin commented Aug 24, 2020 via email

IkerLuengo commented Aug 25, 2020

clalancette left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eboasson commented Nov 9, 2020

ivanpauno commented Nov 9, 2020

clalancette commented Nov 9, 2020

eboasson commented Nov 9, 2020

ivanpauno commented Nov 11, 2020

hidmic commented May 31, 2021

IkerLuengo Aug 25, 2020 •

edited

Loading