-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost service responses (#183, #74) #187
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @eboasson! At first glance, we should be able to backport this for Eloquent and Dashing. Do you agree? Or has anything fundamentally changed within rmw_cyclonedds_cpp
since?
@jacobperron I think it'd be great if we can get this in for Foxy.
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
// | ||
// it is pretty horrid ... but blocking the service is the only option if | ||
// the client is unable to determine that it has been fully discovered by | ||
// the service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is pretty horrid ...
I'm fine with this as long as we agree that this isn't a solution but a workaround for distros up and including Foxy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait why is this needed? Shouldn't the requester wait for a bidirectional match before sending a request? If, in the meantime, the response channel goes away I think this should just fail, not block... Blocking without an external option for a timeout is not very nice in my opinion. The user may not even know it is blocking or for how long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requester is definitely the one that should be doing the waiting. Unfortunately, DDS doesn’t offer a mechanism for waiting for a bidirectional match. Cyclone does guarantee that the writer having matched the reader is sufficient for that reader to receive the data (modulo disconnections and things like that), but that gives you one-way only.
Secondly, the simple counting that is done here (and in other RMW layers) only works if there is but a single server. You really ought to wait until you have a matched pair, but that would additionally require DDS to be aware of these pairs, and it isn’t. We could use, e.g., the USER_DATA
QoS to make the mapping clear at the RMW level, but if service interoperability is desirable that needs to be agreed in a wider audience.
What has been on my to-do list for Cyclone for some time is to make it possible to wait until all local entities have been discovered by all remote entities, much like “wait for acknowledgements” allows you to do this for a regular writer and its readers. Even then, there are little details, for example, one would have to guarantee that effects of a discovery message node P sent to node Q have occurred prior to P receiving Q’s acknowledgment. That’s not guaranteed anywhere in the spec and also not today in Cyclone because it processes all discovery data asynchronously, but it is an important primitive.
As to this change: I considered a variant (check in rmw_take_request
) but the same quality of implementation comes out way more complicated there. A good one would put the request aside and retry/prune based on “publication matched” events. It’s feasible (I should still have a proof-of-concept implementation lying around for something similar as far as the waitset goes) but it is significantly more complex. That’d probably be time better spent on fixing the root cause.
Another mapping would use transient local requests and responses, using the request id as key. It is easy enough in Cyclone, but that’d also be quite a big departure from what has been done until now (by any RMW as far as I know).
The only indefinite blocking this does is when there are clients arriving and leaving all the time. If the currently client simply disappears, both counters will end up being the same once the disappearance is detected. But, yeah ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on continued consideration, I could probably make it work without this hack and without adding a "wait for discovery to complete" operation, but I do need to think through the consequences of the changes it would require in Cyclone. It also wouldn't solve the problem if there are multiple matching services.
So it is not yet a given that it'll happen, and it certainly won't be available immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can all agree that this isn't the correct fix, but a workaround. And @eboasson has been open about it from the get-go. As I see it, the sole purpose of this patch is to reduce the likelihood of a discovery race to impair service performance.
Considering the current behavior is objectively worst (once we've guaranteed an indefinite wait cannot happen here) and that we're about to release an LTS distro, I'd be very much inclined to land this and document accordingly. I'm sure @jacobperron and @dirk-thomas have their own opinions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, it seems like a general design flaw with trying to build ROS services on top of DDS, which plagues all of our RMWs.
A workaround sounds okay for now if it's making the situation better; in this particular case, I would want certainty that the logic is not going to cause an indefinite wait. Maybe that means introducing at timeout.
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
@@ -3135,6 +3147,16 @@ extern "C" rmw_ret_t rmw_send_response( | |||
cdds_request_header_t header; | |||
memcpy(&header.guid, request_header->writer_guid, sizeof(header.guid)); | |||
header.seq = request_header->sequence_number; | |||
// if the number of writers matching our request reader equals the number | |||
// of readers matching our response writer, we have a pretty decent claim | |||
// to have discovered this client's response reader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson nit: consider moving this comment block closer to the check_for_response_reader()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a pretty decent claim
👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we can tell if remote writer and reader belong to the same server instance? Or at least to the same participant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually started working out how to do that, and it is definitely possible the determine they are from the same participant. But now that all nodes in a process (well, context) share the same participant that seems insufficiently selective. I couldn’t find a way to tie them to the node (which I think is selective enough) short of using USER_DATA
. That seemed like it might be unwise. I suppose the graph cache interface could be extended to answer this question.
What we do know here is that we have discovered the client’s writer (else we could never have received its request), and that therefore the client’s writer is accounted for in the “subscription matched” status. Thus, it is pretty likely that the service writer’s match count won’t equal the service’s reader match count unless that client’s reader has been discovered. But there is no guarantee: if another client shows up and its reader happens to be discovered before its writer, you will draw the wrong conclusion.
In the absence of packet loss, that is highly unlikely because it creates the writer first, because under that assumption, the discovery of the writer will precede the discovery of the reader. Indeed, I wouldn’t be surprised if simply swapping the creation of the reader and the writer in the client would make the problem disappear in the test setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I wouldn’t be surprised if simply swapping the creation of the reader and the writer in the client would make the problem disappear in the test setup.
Interesting. Have you tried it? Whatever we can do to reduce the likelihood of an indefinite wait is worth exploring IMO.
// the service. | ||
while (!check_for_response_reader(info->service.sub->enth, info->service.pub->enth)) { | ||
dds_sleepfor(DDS_MSECS(10)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson what if the client goes away before sending the response? I'm fine with the busy wait, but it should timeout at some point. Unless we can detect that the request writer went away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DDS discovery will discover the disappearance of the client, remove all matches with it and decrement the current_count
. So I believe that case is covered. (But as I remarked above, there is a problem if you create/delete clients all the time.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'm still a bit wary about indefinite waits. Even if it rarely occurs, to timeout and fail on replying would give calling code a chance to do something about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson any chance we can leave the loop after some period of time? I'd expect it to be innocuous, unless there's such traffic loss that it's unable to succeed in, say, 100 ms
or more. In which case having the service server throw would be better than having either service server or service client hang silently.
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
@@ -3628,7 +3650,6 @@ extern "C" rmw_ret_t rmw_service_server_is_available( | |||
ret = | |||
common_context->graph_cache.get_writer_count(sub_topic_name, &number_of_response_publishers); | |||
if (ret != RMW_RET_OK || 0 == number_of_response_publishers) { | |||
// error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error case was self-evident and the comment was incorrect for the common case (no response publishers discovered yet). But, yeah, perhaps that ought to be a separate commit.
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
if (dds_get_subscription_matched_status(request_reader, &sm) < 0 || | ||
dds_get_publication_matched_status(response_writer, &pm) < 0) | ||
{ | ||
return RMW_RET_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Thanks! It’ll probably never get there (the function can’t fail if the reader/writer exist) but that’s no excuse for messing up the type so badly ...
I have think about the fix, I might be that return true
is better than returning false
because if this fails once, it’ll probably fail the next time, too. Returning true
would definitely require a comment, and it’d probably make more sense to return an rmw_ret_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to discriminating error conditions from entity absence.
74e71e7
to
4c3b8fa
Compare
Apologies for the force push ... the original two commits are still there, but I noticed that the subject line mentioned the wrong function and I didn't want to run the risk of that mistake getting merged. What is new is:
The mechanism employed is generating unique client/service identifiers (based on the participant GUID), storing these in the reader/writer It adds overhead and complexity to the previous proposal but as far as I can tell, it doesn't suffer from ever having to wait indefinitely (it is by design immune to adding/removing other clients, and it still handles the case where the client has disappeared). But clearly this is still a workaround. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting the time @eboasson ! I wonder though, is this change backwards compatible (i.e. it can work with a peer participant running without this patch, with the known potential races)? Also, won't this change compromise cross-vendor communication?
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
void * ud; | ||
size_t udsz; | ||
if (!dds_qget_userdata(qos, &ud, &udsz)) { | ||
std::map<std::string, std::vector<uint8_t>> emptymap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson nit: I'd think any compiler would be smart enough to copy-elide the return value, but
std::map<std::string, std::vector<uint8_t>> map;
void * ud;
size_t udsz;
if (dds_qget_userdata(qos, &ud, &udsz)) {
std::vector<uint8_t> udvec(static_cast<uint8_t *>(ud), static_cast<uint8_t *>(ud) + udsz);
dds_free(ud);
map = rmw::impl::cpp::parse_key_value(udvec);
}
return map;
would make it simpler and slightly clearer.
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
dds_entity_t writer, dds_instance_handle_t readerih) | ||
{ | ||
std::unique_ptr<dds_builtintopic_endpoint_t, std::function<void(dds_builtintopic_endpoint_t *)>> | ||
ep(dds_get_matched_subscription_data(writer, readerih), &free_builtintopic_endpoint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson nit:
ep(dds_get_matched_subscription_data(writer, readerih), &free_builtintopic_endpoint); | |
ep(dds_get_matched_subscription_data(writer, readerih), free_builtintopic_endpoint); |
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
dds_entity_t reader, dds_instance_handle_t writerih) | ||
{ | ||
std::unique_ptr<dds_builtintopic_endpoint_t, std::function<void(dds_builtintopic_endpoint_t *)>> | ||
ep(dds_get_matched_publication_data(reader, writerih), &free_builtintopic_endpoint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson nit:
ep(dds_get_matched_publication_data(reader, writerih), &free_builtintopic_endpoint); | |
ep(dds_get_matched_publication_data(reader, writerih), free_builtintopic_endpoint); |
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
{ | ||
std::ostringstream os; | ||
os << std::hex; | ||
os << std::setw(2) << static_cast<int>(static_cast<uint8_t>(id.writer_guid[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson why static_cast<uint8_t>(...)
? Are you expecting to truncate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That cast is there to avoid sign extension (id.writer_guid
is an array of int8_t
...)
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
for (auto rdih : rds) { | ||
auto rd = get_matched_subscription_data(client.pub->enth, rdih); | ||
std::string serviceid; | ||
if (rd.get() && get_user_data_key(rd->qos, "serviceid", serviceid)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson nit:
if (rd.get() && get_user_data_key(rd->qos, "serviceid", serviceid)) { | |
if (rd && get_user_data_key(rd->qos, "serviceid", serviceid)) { |
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
for (auto wrih : wrs) { | ||
auto wr = get_matched_publication_data(client.sub->enth, wrih); | ||
std::string serviceid; | ||
if (wr.get() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson nit:
if (wr.get() && | |
if (wr && |
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
wr->qos, "serviceid", | ||
serviceid) && needles.find(serviceid) != needles.end()) | ||
{ | ||
*is_available = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson considering asserting that is_available != nullptr && !*is_available
, which seems to be the implicit assumption in this function.
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
} | ||
const std::string needle = request_id_writer_guid_to_string(reqid); | ||
// if we have matched this client's reader, all is well | ||
for (auto rdih : rds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson nit:
for (auto rdih : rds) { | |
for (auto & rdih : rds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that dds_instance_handle_t
is just a 64-bit integer, I don't understand why using a reference would be an improvement
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
} | ||
// if not, we should stop waiting if the writer is no longer there, | ||
// as that implies the client no longer exists | ||
for (auto wrih : wrs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson nit:
for (auto wrih : wrs) { | |
for (auto & wrih : wrs) { |
// break instead of returns makes gcc happy | ||
break; | ||
case client_present_t::MAYBE: | ||
return RMW_RET_TIMEOUT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson how could st
ever be client_present_t::MAYBE
at this point if that wouldn't allow it to leave the loop in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, and it would be better to combine it with the ERROR
case. I usually go out of my way to treat enum
s as enumerated types and cover all the cases in switches over them. But C & C++ semantics don't guarantee that one can assign only defined values to an object of an enumerated type, and what then happens is compiler dependent.
Clang doesn't warn if you cover all cases of an enum, presumably on the assumption that you treat the enum as an enum (or perhaps it proves that no other values ever get assigned to it). Gcc warns, presumably because technically other values are legal too and, again presumably, it doesn't do the analysis to prove that no other values ever get assigned to it. As the comment suggests, there used to be a return in line 3294. With that, the MAYBE
case makes a bit more sense. Now, it just looks silly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments, but the new approach looks pretty good to me
@@ -490,6 +508,33 @@ static void get_entity_gid(dds_entity_t h, rmw_gid_t & gid) | |||
convert_guid_to_gid(guid, gid); | |||
} | |||
|
|||
static std::map<std::string, std::vector<uint8_t>> parse_user_data(const dds_qos_t * qos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer std::unordered_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_key_value
from rmw
uses std::map
so I think it is better to leave this as-is
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
// sizeof((reinterpret_cast<rmw_request_id_t *>(0))->writer_guid) | ||
// | ||
// is not a constant, and the 16 is a hard-coded magic number in | ||
// rmw_request_id_t ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully follow the comment, but you can also suppress a cpplint warning by using problematic line of code; // NOLINT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually two comments with bad formatting: one that I put in originally: "strangely, the writer_guid in a request id is smaller than the rmw_gid_t". I find it weird that the writer_guid in a rmw_request_t
is 16 bytes whereas a rmw_gid_t
has 24 bytes.
The second bit is about lint. I didn't know about // NOLINT
, but that's a better solution than a hard-coded constant. Thanks!
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
static rmw_ret_t get_matched_endpoints( | ||
dds_entity_t h, dds_return_t (* fn)( | ||
dds_entity_t h, | ||
dds_instance_handle_t * xs, size_t nxs), std::vector<dds_instance_handle_t> & res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: defining the pointer to function before highly increases readability:
using get_matched_endpoints_fn_t = dds_return_t (* fn)(
dds_entity_t h,
dds_instance_handle_t * xs, size_t nxs);
static rmw_ret_t get_matched_endpoints(
dds_entity_t h, get_matched_endpoints_fn_t fn, std::vector<dds_instance_handle_t> & res)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have seen too many C function pointer types ... :) but yes, that would be wise
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
dds_free(e); | ||
} | ||
|
||
static std::unique_ptr<dds_builtintopic_endpoint_t, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly here:
using BuiltinTopicEndpoint = std::unique_ptr<dds_builtintopic_endpoint_t,
std::function<void(dds_builtintopic_endpoint_t *)>>;
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
std::ostringstream os; | ||
os << std::hex; | ||
os << std::setw(2) << static_cast<int>(static_cast<uint8_t>(id.writer_guid[0])); | ||
for (size_t i = 1; i < sizeof(id.writer_guid) - 1; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be < sizeof(id.writer_guid)
?
it seems that the last byte won't be copied if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely. (I changed it from 0 .. n-1 to 1..n, except I didn't change the upper bound.) There are not enough services/clients in tests to catch this one. This would have been a source of nasty bugs ...
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
std::ostringstream os; | ||
os << std::hex; | ||
os << std::setw(2) << static_cast<int>(id.data[0]); | ||
for (size_t i = 1; i < sizeof(id.data) - 1; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same about the - 1
, it doesn't sound correct ...
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
} | ||
// first extract all service ids from matched readers | ||
std::set<std::string> needles; | ||
for (auto rdih : rds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const auto &
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still applies.
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
return RMW_RET_OK; | ||
} | ||
// then scan the writers to see if there is at least one with a service id in the set | ||
for (auto wrih : wrs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const auto &
@hidmic and @ivanpauno thanks for reviewing (and catching a few bugs!). I'm fine with all the nitpicks. I'll sort them out tomorrow, but I thought it might make sense to quickly respond to the comments given the time zone differences and the urgency. @hidmic, regarding your two overall questions:
It isn't: firstly, the wire representation of the request id changed from the quick hack of yore to the kind-of sensible Secondly, it now only matches services/clients that have these identifiers. Making it backwards compatible requires treating an "unidentified" reader and writer pair as sufficient in
That doesn't work anyway ... different wire representations, use of vendor-specific tricks ... |
I believe the first of the two commits addresses all the comments regarding small details (unless noted otherwise — if I am mistaken, I'll be happy to change them after all, as I don't know the C++ idiom very well). The second addresses backwards compatibility. It also happens to make the code a bit simpler, too. The changes in that commit outside There is still the matter of the desirability of blocking in
Note that I don't think adding a timeout make sense: the duration would be completely arbitrary and the service implementations would then have be modified to deal with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@wjwwood @ivanpauno @jacobperron I'd like your approval as well before merging anything. Test failures seem unrelated.
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
} | ||
// first extract all service ids from matched readers | ||
std::set<std::string> needles; | ||
for (auto rdih : rds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still applies.
rmw_cyclonedds_cpp/src/rmw_node.cpp
Outdated
} else { | ||
// scan the writers to see if there is at least one response writer | ||
// matching a discovered request reader | ||
for (auto wrih : wrs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson nit: const auto & wrih
// the service. | ||
while (!check_for_response_reader(info->service.sub->enth, info->service.pub->enth)) { | ||
dds_sleepfor(DDS_MSECS(10)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eboasson any chance we can leave the loop after some period of time? I'd expect it to be innocuous, unless there's such traffic loss that it's unable to succeed in, say, 100 ms
or more. In which case having the service server throw would be better than having either service server or service client hang silently.
header.seq = request_header->sequence_number; | ||
return rmw_send_response_request(&info->service, header, ros_response); | ||
// Block until the response reader has been matched by the response writer (this is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussions we've had about this and the corresponding Fast-RTPS pull request, I'm good with it. I'll have another look, but the code changes also look good. |
@eboasson I think Windows CI compilation issues stem from |
Given that we're scheduled to release this week, I'd be more comfortable holding this PR for the first patch release to give us more time for testing. This means holding until the end of this week. |
I think this covers all the remarks. It has a conflict in |
@eboasson +1, feel free to rebase. |
The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <[email protected]>
Signed-off-by: Erik Boasson <[email protected]>
Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <[email protected]>
Signed-off-by: Erik Boasson <[email protected]>
* Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times.
Signed-off-by: Erik Boasson <[email protected]>
The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <[email protected]>
Signed-off-by: Erik Boasson <[email protected]>
Signed-off-by: Erik Boasson <[email protected]>
e5f96b1
to
e17e9ed
Compare
Once Windows' CI is back, I think this is ok to go. |
Alright, going in. |
* Block rmw_send_response if response reader unknown The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <[email protected]> * Change request headers to use rmw_request_id_t on the wire Signed-off-by: Erik Boasson <[email protected]> * Precise check for matched client/service Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <[email protected]> * Address review comments Signed-off-by: Erik Boasson <[email protected]> * Backwards compatibility * Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times. * Replace ERROR by FAILURE to because of windows.h Signed-off-by: Erik Boasson <[email protected]> * Timeout rmw_send_response after waiting 100ms for discovery The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <[email protected]> * Make iterators "const auto &" Signed-off-by: Erik Boasson <[email protected]> * Add TODO for eliminating rmw_send_response blocking Signed-off-by: Erik Boasson <[email protected]>
* Block rmw_send_response if response reader unknown The client checks using rmw_service_server_is_available whether the request it sends will be delivered to service, but that does not imply that the (independent, as far as DDS is concerned) response reader of the client has been discovered by the service. Usually that will be the case, but there is no guarantee. Ideally DDS would offer an interface that allows checking the reverse discovery, but that does not yet exist in either the specification or in Cyclone. This commit works around that by delaying publishing the response until the number of request writers matches the number of response readers. Signed-off-by: Erik Boasson <[email protected]> * Change request headers to use rmw_request_id_t on the wire Signed-off-by: Erik Boasson <[email protected]> * Precise check for matched client/service Assign a unique identifier to each client/service on creation, add it to the USER_DATA QoS of the reader and writer and use it for the request ids. This allows: * rmw_service_server_is_available to only return true once it has discovered a reader/writer pair of a single service (rather than a reader from some service and a writer from some service); and * rmw_send_response to block until it has discovered the requesting client's response reader and to abandon the operation when the client has disappeared. The USER_DATA is formatted in the same manner as the participant USER_DATA, this uses the keys "serviceid" and "clientid". This is still but a workaround for having a mechanism in DDS to ensure that the response reader has been discovered prior by the request writer prior to sending the request. Signed-off-by: Erik Boasson <[email protected]> * Address review comments Signed-off-by: Erik Boasson <[email protected]> * Backwards compatibility * Revert commit fb040c5 to retain the old wire representation; * Embed the publication_handle of the request inside rmw_request_id_t, possible because reverting to the old wire representation frees up enough space, and use this in rmw_send_response to check for the presence of the client's reader; * Clients and services without a client/service id in the reader/writer user data are treated as fully matched at all times. * Replace ERROR by FAILURE to because of windows.h Signed-off-by: Erik Boasson <[email protected]> * Timeout rmw_send_response after waiting 100ms for discovery The discovery will eventually result in the client's reader being known or its writer no longer being known, so a timeout is not necessary for correctness. However, if it ever were to block for a longish time (which is possible in the face of network failures), returning a timeout to the caller is expected to result in less confusion. Signed-off-by: Erik Boasson <[email protected]> * Make iterators "const auto &" Signed-off-by: Erik Boasson <[email protected]> * Add TODO for eliminating rmw_send_response blocking Signed-off-by: Erik Boasson <[email protected]> Co-authored-by: eboasson <[email protected]>
This pull request has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/new-packages-for-foxy-fitzroy-2020-07-23/15570/2 |
This pull request has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/new-packages-and-patch-release-for-ros-2-foxy-fitzroy-2020-08-07/15818/1 |
This PR addresses the service invocation problems by fixing one silly bug in the
rmw_service_server_is_available
code (not actually checking the number of matched endpoints), and by blocking inrmw_send_response
until there is reasonable evidence that the response reader has been discovered.The proper solution (as discussed in #74) makes the
rmw_service_server_is_available
return false until this point has been reached, but as of today, neither the DDS specification nor Cyclone DDS provides the means to do that without the application exchanging information on what has been discovered. It can be done easily enough, but it is a rather significant burden for what is ultimately a rare problem.Without the workaround (but with the bug fix) it rarely fails. With the workaround added, I have not been able to reproduce it anymore. I've only seen multiple waits in sequence by introducing significant packet loss.
I do expect this workaround to be somewhat controversial ...