SecondaryTcpServer hangs when shutting down #1803

WeekiatAngMotional · 2022-02-03T07:52:39Z

Hi,

We are running aktualizr as part of a systemd service, and when a secondary ECU is tasked to shut down, the network.service service is first brought down, followed by the aktualizr service. When the aktualizr process receives the terminate signal, we call SecondaryTcpServer's stop() which starts a local connection to unblock the accept call in the run() function:

void SecondaryTcpServer::stop() {
  LOG_DEBUG << "Stopping Secondary TCP server...";
  keep_running_.store(false);
  // unblock accept
  ConnectionSocket("localhost", listen_socket_.port()).connect();
}

Due to the network service being down, the accept() call in the SecondaryTcpServer::run() function never gets unblocked, which causes systemd to forcefully terminate the process after a long wait time:

systemd[1]: Stopped Network Name Resolution.
systemd[1]: network.service: Deactivated successfully.
systemd[1]: Stopped Network Connectivity.
systemd[1]: aktualizr.service: State 'final-sigterm' timed out. Killing.
systemd[1]: aktualizr.service: Killing process 713 (aktualizr-agen) wit.
systemd[1]: aktualizr.service: Failed with result 'timeout'.
systemd[1]: aktualizr.service: Unit process 713 (aktualizr-agen) remain.
systemd[1]: Stopped aktualizr pipeline.
systemd[1]: Stopped target Basic System.
systemd[1]: Stopped target Path Units.

A fix was done on our end via changing the server socket to be non-blocking such that we poll accept() calls instead, but we would like to mainly know if this is an issue from our ~~issue~~ usage of aktualizr, if we are shutting down services in a different order than aktualizr's guidelines, or if there are working tests run on machines where the network is brought down first.

Here's a successful shut down for reference:

systemd[1]: Stopped Network Name Resolution.
systemd[1]: network.service: Deactivated successfully.
systemd[1]: Stopped Network Connectivity.
systemd[1]: aktualizr.service: Deactivated successfully.
systemd[1]: Stopped aktualizr pipeline.
systemd[1]: Stopped target Basic System.
systemd[1]: Stopped target Path Units.

Thank you!

The text was updated successfully, but these errors were encountered:

pattivacek · 2022-02-04T14:02:05Z

Due to the network service being down, the accept() call in the SecondaryTcpServer::run() function never gets unblocked, which causes systemd to forcefully terminate the process after a long wait time:

That sounds annoying indeed.

A fix was done on our end via changing the server socket to be non-blocking such that we poll accept() calls instead

That sounds great! Do you have a retry loop or some other mechanism to keep checking for a connection?

we would like to mainly know if this is an issue from our issue of aktualizr

No, I don't think you're doing anything particularly strange there. I think you're just doing something that hadn't been thought of before.

if we are shutting down services in a different order than aktualizr's guidelines

There are no guidelines, for better or worse!

if there are working tests run on machines where the network is brought down first.

Not as far as I know, but I could be surprised.

How would you feel about making a PR with your changes over at https://github.com/uptane/aktualizr?

I think @cajun-rat and @mike-sul wrote most of the original implementation here, so either of them might have some thoughts, too.

cajun-rat · 2022-02-06T12:37:51Z

I agree. This appears to be a bug in the implementation of SecondaryTcpServer::stop(), which attempts to cancel the accept with:

  // unblock accept
  ConnectionSocket("localhost", listen_socket_.port()).connect();

Stackoverflow suggests that the solution is to call shutdown on the socket instead, but I haven't looked into it in detail.

As @pattivacek said, a PR to fix it on https://github.com/uptane/aktualizr would be welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SecondaryTcpServer hangs when shutting down #1803

SecondaryTcpServer hangs when shutting down #1803

WeekiatAngMotional commented Feb 3, 2022 •

edited

Loading

pattivacek commented Feb 4, 2022

cajun-rat commented Feb 6, 2022

SecondaryTcpServer hangs when shutting down #1803

SecondaryTcpServer hangs when shutting down #1803

Comments

WeekiatAngMotional commented Feb 3, 2022 • edited Loading

pattivacek commented Feb 4, 2022

cajun-rat commented Feb 6, 2022

WeekiatAngMotional commented Feb 3, 2022 •

edited

Loading