Fix timed-event lockup on pre-emptive ACKNACKs #1887

eboasson · 2023-11-24T11:42:12Z

There was a case in which the ACKNACK generation event generating
pre-emptive ACKNACKs could continually get rescheduled for the same
time, causing the timed-event mechanism to spin generating those
pre-emptive ACKNACKS but without otherwise making progress. When certain
conditions change externally, it continues normal operation:

reception of a heartbeat from the writer to which it is sending
pre-emptive ACKNACKs (by virtue of switching to regular ACKNACKs);
deletion of that proxy writer corresponding to that writer;
deletion of the reader;

The issue is that discovery matches the two and creates a struct
ddsi_rd_pwr_match, it sets the time stamp of creation. This time may
lay in the future of the time at which the ACKNACK event is initially
scheduled, because the creation time is taken from the clock directly,
whereas the initial schedule time is taken from the approximate time
stamp passed into ddsi_proxy_writer_add_connection.

So on executing the ACKNACK event handler, the current time passed into
the handler may lay before the creation time. The "time of last
ack" (t_last_ack) gets set to this time and the event gets rescheduled
for t_last_ack + new_intv, where new_intv = 0 if it precedes the
creation time.

The timed event handler caches the current time, updating it only when
it is known more than a trivial amount of time may have passed. Rightly
or wrongly, executing a timed-event doesn't cause the clock to be read
again, and so the first event in the queue is the event it just
executed, and it immediately executes it again, in exactly the same
state.

Among the consequences are the obvious one of not processing any events
scheduled for a later time, but also not sending out any retransmit
queued by the receive path as well as discovery messages queued for
transmission.

There are multiple contributing factors here:

It is daft to set the creation time to a later time than the initial
schedule.
It is, arguably, unwise to cache the time so aggressively. This
originates in the cost of reading the clock in older systems. This is
presumably still the case in various embedded systems, so at least
there is a decent argument for it.

But the root cause in my view is the fact that it can reschedule the
event for the same time.

All timestamps are in [0, INT64_MAX ns), so deleting the special case of a
current time that precedes the creation time does not introduce signed
integer overflow.

For times in [0,1s) it can potentially delay the transmission of the
initial pre-emptive ACKNACK by 1s. It is exceedingly unlikely that it
will ever need to send one in that interval, and even more unlikely that
it affects anything, because the writer will be sending heartbeats as
well.

There was a case in which the ACKNACK generation event generating pre-emptive ACKNACKs could continually get rescheduled for the same time, causing the timed-event mechanism to spin generating those pre-emptive ACKNACKS but without otherwise making progress. When certain conditions change externally, it continues normal operation: * reception of a heartbeat from the writer to which it is sending pre-emptive ACKNACKs (by virtue of switching to regular ACKNACKs); * deletion of that proxy writer corresponding to that writer; * deletion of the reader; The issue is that discovery matches the two and creates a struct ddsi_rd_pwr_match, it sets the time stamp of creation. This time may lay in the future of the time at which the ACKNACK event is initially scheduled, because the creation time is taken from the clock directly, whereas the initial schedule time is taken from the approximate time stamp passed into ddsi_proxy_writer_add_connection. So on executing the ACKNACK event handler, the current time passed into the handler may lay before the creation time. The "time of last ack" (t_last_ack) gets set to this time and the event gets rescheduled for t_last_ack + new_intv, where new_intv = 0 if it precedes the creation time. The timed event handler caches the current time, updating it only when it is known more than a trivial amount of time may have passed. Rightly or wrongly, executing a timed-event doesn't cause the clock to be read again, and so the first event in the queue is the event it just executed, and it immediately executes it again, in exactly the same state. Among the consequences are the obvious one of not processing any events scheduled for a later time, but also not sending out any retransmit queued by the receive path as well as discovery messages queued for transmission. There are multiple contributing factors here: * It is daft to set the creation time to a later time than the initial schedule. * It is, arguably, unwise to cache the time so aggressively. This originates in the cost of reading the clock in older systems. This is presumably still the case in various embedded systems, so at least there is a decent argument for it. But the root cause in my view is the fact that it can reschedule the event for the same time. All timestamps are in [0, INT64_MAX ns), so deleting the special case of a current time that precedes the creation time does not introduce signed integer overflow. For times in [0,1s) it can potentially delay the transmission of the initial pre-emptive ACKNACK by 1s. It is exceedingly unlikely that it will ever need to send one in that interval, and even more unlikely that it affects anything, because the writer will be sending heartbeats as well. Signed-off-by: Erik Boasson <[email protected]>

This avoids the strange situation where ACKNACK events can be scheduled at times earlier than the purported creation time of the match object. That strange situation is harmless if progress is guaranteed in the scheduling of events, which is addressed in a separate commit. Signed-off-by: Erik Boasson <[email protected]>

dpotman

LGTM

eboasson added 2 commits November 24, 2023 10:29

dpotman approved these changes Nov 27, 2023

View reviewed changes

eboasson merged commit 0ed1837 into eclipse-cyclonedds:master Nov 27, 2023
20 of 22 checks passed

eboasson mentioned this pull request Dec 8, 2023

Fix global buffer overflow in IDLC #1900

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix timed-event lockup on pre-emptive ACKNACKs #1887

Fix timed-event lockup on pre-emptive ACKNACKs #1887

eboasson commented Nov 24, 2023

dpotman left a comment

Fix timed-event lockup on pre-emptive ACKNACKs #1887

Fix timed-event lockup on pre-emptive ACKNACKs #1887

Conversation

eboasson commented Nov 24, 2023

dpotman left a comment

Choose a reason for hiding this comment