Fix timed-event lockup on pre-emptive ACKNACKs #1887
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a case in which the ACKNACK generation event generating
pre-emptive ACKNACKs could continually get rescheduled for the same
time, causing the timed-event mechanism to spin generating those
pre-emptive ACKNACKS but without otherwise making progress. When certain
conditions change externally, it continues normal operation:
reception of a heartbeat from the writer to which it is sending
pre-emptive ACKNACKs (by virtue of switching to regular ACKNACKs);
deletion of that proxy writer corresponding to that writer;
deletion of the reader;
The issue is that discovery matches the two and creates a struct
ddsi_rd_pwr_match, it sets the time stamp of creation. This time may
lay in the future of the time at which the ACKNACK event is initially
scheduled, because the creation time is taken from the clock directly,
whereas the initial schedule time is taken from the approximate time
stamp passed into ddsi_proxy_writer_add_connection.
So on executing the ACKNACK event handler, the current time passed into
the handler may lay before the creation time. The "time of last
ack" (t_last_ack) gets set to this time and the event gets rescheduled
for t_last_ack + new_intv, where new_intv = 0 if it precedes the
creation time.
The timed event handler caches the current time, updating it only when
it is known more than a trivial amount of time may have passed. Rightly
or wrongly, executing a timed-event doesn't cause the clock to be read
again, and so the first event in the queue is the event it just
executed, and it immediately executes it again, in exactly the same
state.
Among the consequences are the obvious one of not processing any events
scheduled for a later time, but also not sending out any retransmit
queued by the receive path as well as discovery messages queued for
transmission.
There are multiple contributing factors here:
It is daft to set the creation time to a later time than the initial
schedule.
It is, arguably, unwise to cache the time so aggressively. This
originates in the cost of reading the clock in older systems. This is
presumably still the case in various embedded systems, so at least
there is a decent argument for it.
But the root cause in my view is the fact that it can reschedule the
event for the same time.
All timestamps are in [0, INT64_MAX ns), so deleting the special case of a
current time that precedes the creation time does not introduce signed
integer overflow.
For times in [0,1s) it can potentially delay the transmission of the
initial pre-emptive ACKNACK by 1s. It is exceedingly unlikely that it
will ever need to send one in that interval, and even more unlikely that
it affects anything, because the writer will be sending heartbeats as
well.