Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timed-event lockup on pre-emptive ACKNACKs #1887

Merged

Conversation

eboasson
Copy link
Contributor

There was a case in which the ACKNACK generation event generating
pre-emptive ACKNACKs could continually get rescheduled for the same
time, causing the timed-event mechanism to spin generating those
pre-emptive ACKNACKS but without otherwise making progress. When certain
conditions change externally, it continues normal operation:

  • reception of a heartbeat from the writer to which it is sending
    pre-emptive ACKNACKs (by virtue of switching to regular ACKNACKs);

  • deletion of that proxy writer corresponding to that writer;

  • deletion of the reader;

The issue is that discovery matches the two and creates a struct
ddsi_rd_pwr_match, it sets the time stamp of creation. This time may
lay in the future of the time at which the ACKNACK event is initially
scheduled, because the creation time is taken from the clock directly,
whereas the initial schedule time is taken from the approximate time
stamp passed into ddsi_proxy_writer_add_connection.

So on executing the ACKNACK event handler, the current time passed into
the handler may lay before the creation time. The "time of last
ack" (t_last_ack) gets set to this time and the event gets rescheduled
for t_last_ack + new_intv, where new_intv = 0 if it precedes the
creation time.

The timed event handler caches the current time, updating it only when
it is known more than a trivial amount of time may have passed. Rightly
or wrongly, executing a timed-event doesn't cause the clock to be read
again, and so the first event in the queue is the event it just
executed, and it immediately executes it again, in exactly the same
state.

Among the consequences are the obvious one of not processing any events
scheduled for a later time, but also not sending out any retransmit
queued by the receive path as well as discovery messages queued for
transmission.

There are multiple contributing factors here:

  • It is daft to set the creation time to a later time than the initial
    schedule.

  • It is, arguably, unwise to cache the time so aggressively. This
    originates in the cost of reading the clock in older systems. This is
    presumably still the case in various embedded systems, so at least
    there is a decent argument for it.

But the root cause in my view is the fact that it can reschedule the
event for the same time.

All timestamps are in [0, INT64_MAX ns), so deleting the special case of a
current time that precedes the creation time does not introduce signed
integer overflow.

For times in [0,1s) it can potentially delay the transmission of the
initial pre-emptive ACKNACK by 1s. It is exceedingly unlikely that it
will ever need to send one in that interval, and even more unlikely that
it affects anything, because the writer will be sending heartbeats as
well.

There was a case in which the ACKNACK generation event generating
pre-emptive ACKNACKs could continually get rescheduled for the same
time, causing the timed-event mechanism to spin generating those
pre-emptive ACKNACKS but without otherwise making progress. When certain
conditions change externally, it continues normal operation:

* reception of a heartbeat from the writer to which it is sending
  pre-emptive ACKNACKs (by virtue of switching to regular ACKNACKs);

* deletion of that proxy writer corresponding to that writer;

* deletion of the reader;

The issue is that discovery matches the two and creates a struct
ddsi_rd_pwr_match, it sets the time stamp of creation.  This time may
lay in the future of the time at which the ACKNACK event is initially
scheduled, because the creation time is taken from the clock directly,
whereas the initial schedule time is taken from the approximate time
stamp passed into ddsi_proxy_writer_add_connection.

So on executing the ACKNACK event handler, the current time passed into
the handler may lay before the creation time.  The "time of last
ack" (t_last_ack) gets set to this time and the event gets rescheduled
for t_last_ack + new_intv, where new_intv = 0 if it precedes the
creation time.

The timed event handler caches the current time, updating it only when
it is known more than a trivial amount of time may have passed.  Rightly
or wrongly, executing a timed-event doesn't cause the clock to be read
again, and so the first event in the queue is the event it just
executed, and it immediately executes it again, in exactly the same
state.

Among the consequences are the obvious one of not processing any events
scheduled for a later time, but also not sending out any retransmit
queued by the receive path as well as discovery messages queued for
transmission.

There are multiple contributing factors here:

* It is daft to set the creation time to a later time than the initial
  schedule.

* It is, arguably, unwise to cache the time so aggressively. This
  originates in the cost of reading the clock in older systems. This is
  presumably still the case in various embedded systems, so at least
  there is a decent argument for it.

But the root cause in my view is the fact that it can reschedule the
event for the same time.

All timestamps are in [0, INT64_MAX ns), so deleting the special case of a
current time that precedes the creation time does not introduce signed
integer overflow.

For times in [0,1s) it can potentially delay the transmission of the
initial pre-emptive ACKNACK by 1s. It is exceedingly unlikely that it
will ever need to send one in that interval, and even more unlikely that
it affects anything, because the writer will be sending heartbeats as
well.

Signed-off-by: Erik Boasson <[email protected]>
This avoids the strange situation where ACKNACK events can be scheduled
at times earlier than the purported creation time of the match object.

That strange situation is harmless if progress is guaranteed in the
scheduling of events, which is addressed in a separate commit.

Signed-off-by: Erik Boasson <[email protected]>
Copy link
Contributor

@dpotman dpotman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eboasson eboasson merged commit 0ed1837 into eclipse-cyclonedds:master Nov 27, 2023
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants