Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failures of tcp UT #2709

Closed
wants to merge 10 commits into from
Closed

Conversation

JohanBertrand
Copy link
Contributor

@JohanBertrand JohanBertrand commented Apr 29, 2024

Related Issue(s) #2706
Has Unit Tests (y/n) Y
Documentation Included (y/n) Y

Change Description

Fix issue where the TCP client/server would fail to connect to a socket, generating an error and making the UT to fail.

Partially solves #2706, as this does not solve the warnings/silent failures

Rationale

The TCP unit tests should pass in a reliable manner.
This also can affect the flight code if the "bind to 0" strategy is used to select a port.

Drv/Ip/TcpClientSocket.cpp Fixed Show fixed Hide fixed
Drv/Ip/TcpServerSocket.cpp Fixed Show fixed Hide fixed
Copy link
Collaborator

@LeStarch LeStarch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like to see this as a defaulted argument to the openProtocol call because this code affects UTs (where turning this option on is needed) and flight code where this option might be a vulnerability depending on the remote side of these TCP connections.

openProtocol(NATIVE_INT_TYPE& fd, bool reuse_address=false)

Since there is a potential for some plumbing through components, and changes to the UTs, please let me know if you'd like some assistance performing this work.

Thanks for you contributions!

@@ -57,12 +57,22 @@ SocketIpStatus TcpClientSocket::openProtocol(NATIVE_INT_TYPE& fd) {
#if defined TGT_OS_TYPE_VXWORKS || TGT_OS_TYPE_DARWIN
address.sin_len = static_cast<U8>(sizeof(struct sockaddr_in));
#endif
if (reuse_address)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comments, I didn't expect this code in the client....only in the server. If that is true, then perhaps it is only needed in the startup function and not the open functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried to run the unit tests with the 4 different configurations, and it seems like both sockets are needing the REUSEADDR option to avoid errors. You can double check from your side using the procedure mentioned in the issue #2706 on the TcpClient UT.

To be honest, I am not an expert in TCP sockets, and I can not really explain the real reason behind this behavior, sadly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can attempt to explain why these errors occur. TCP has two endpoints (a sever and a client). When the connection is terminated there are two termination possibilities:

  1. Uncoordinated termination (typically caused by calling close before calling shutdown)
  2. Coordinated termination (caused by calling shutdown and then close)

In the uncoordinated case, a port will end-up in TIME_WAIT* state, where it waits for a specified timeout (typically 60s) to let any remote that was not properly informed to shutdown notice the connection is down before allowing the port to be reused. Otherwise, the client might send data to a reused port accidentally leaking data.

In the coordinated case, a message is sent across TCP informing the remote to shutdown and an acknowledgement is sent back. Then the port can be closed, and immediately reused.

These errors occur when the port cannot be reused immediately due to an uncoordinated shutdown, or an unacknowledged coordinated shutdown. Thus the port is unable to be reused, but the UTs expect it can.

Given the random nature of these errors, it is hard to reproduce with enough insight to determine which of these cases occur and exactly why. In an ideal world, the UTs would conduct a proper shutdown always, and thus not need the ability to "reuse" waiting ports. However, your proposed solution is deeply practical as it quickly resolves the issue without trying to reproduce a random bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LeStarch Thanks for the explanations.

What would be the path forward for this MR?
Do you want me to investigate more on the proper shutdown of the sockets for the UT instead of having the reuse option?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the reuse option as a fallback. Especially given the time required to figure out the cause of the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want any other changes in the current MR then?

I will anyways investigate on the proper shutdown of the UT in the future, and come back to you if I have any findings.

this->m_lock.lock();
this->m_started = true;
this->m_lock.unLock();
return SOCK_SUCCESS;
}

SocketIpStatus IpSocket::open() {
SocketIpStatus IpSocket::open(const bool reuse_address) {

Check notice

Code scanning / CodeQL

Long function without assertion Note

All functions of more than 10 lines should have at least one assertion.
@@ -41,7 +41,7 @@

TcpClientSocket::TcpClientSocket() : IpSocket() {}

SocketIpStatus TcpClientSocket::openProtocol(NATIVE_INT_TYPE& fd) {
SocketIpStatus TcpClientSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) {

Check notice

Code scanning / CodeQL

Long function without assertion Note

All functions of more than 10 lines should have at least one assertion.
@@ -39,7 +40,7 @@

TcpServerSocket::TcpServerSocket() : IpSocket(), m_base_fd(-1) {}

SocketIpStatus TcpServerSocket::startup() {
SocketIpStatus TcpServerSocket::startup(const bool reuse_address) {

Check notice

Code scanning / CodeQL

Long function without assertion Note

All functions of more than 10 lines should have at least one assertion.
@@ -90,7 +102,10 @@
this->IpSocket::shutdown();
}

SocketIpStatus TcpServerSocket::openProtocol(NATIVE_INT_TYPE& fd) {
SocketIpStatus TcpServerSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) {

Check notice

Code scanning / CodeQL

Long function without assertion Note

All functions of more than 10 lines should have at least one assertion.
@@ -97,7 +97,10 @@
return SOCK_SUCCESS;
}

SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd) {
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) {

Check notice

Code scanning / CodeQL

Long function without assertion Note

All functions of more than 10 lines should have at least one assertion.
// Can happen if a function like get_free_port() is called to determine an available port by binding to port 0.
const int enable = 1;
if (setsockopt(socketFd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable)) < 0) {
::close(socketFd);

Check warning

Code scanning / CodeQL

Unchecked return value Warning

The return value of non-void function
close
is not checked.
// Can happen if a function like get_free_port() is called to determine an available port by binding to port 0.
const int enable = 1;
if (setsockopt(serverFd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable)) < 0) {
::close(serverFd);

Check warning

Code scanning / CodeQL

Unchecked return value Warning

The return value of non-void function
close
is not checked.
SocketIpStatus SocketReadTask::open() {
SocketIpStatus status = this->getSocketHandler().open();
SocketIpStatus SocketReadTask::open(const bool reuse_address) {
SocketIpStatus status = this->getSocketHandler().open(reuse_address);

Check warning

Code scanning / CodeQL

Unchecked function argument Warning

This use of parameter reuse_address has not been checked.
SocketIpStatus SocketReadTask::startup() {
return this->getSocketHandler().startup();
SocketIpStatus SocketReadTask::startup(const bool reuse_address) {
return this->getSocketHandler().startup(reuse_address);

Check warning

Code scanning / CodeQL

Unchecked function argument Warning

This use of parameter reuse_address has not been checked.
const Os::Task::ParamType priority,
const Os::Task::ParamType stack,
const Os::Task::ParamType cpuAffinity) {
FW_ASSERT(not m_task.isStarted()); // It is a coding error to start this task multiple times
FW_ASSERT(not this->m_stop); // It is a coding error to stop the thread before it is started
m_reconnect = reconnect;
m_reuse_address = reuse_address;

Check warning

Code scanning / CodeQL

Unchecked function argument Warning

This use of parameter reuse_address has not been checked.
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd) {
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) {
// reuse_address is not applicable for the UDP socket
(void)(reuse_address);

Check warning

Code scanning / CodeQL

Unchecked function argument Warning

This use of parameter reuse_address has not been checked.
SocketIpStatus TcpServerSocket::openProtocol(NATIVE_INT_TYPE& fd) {
SocketIpStatus TcpServerSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) {
// reuse_address is not applicable for the TCP server socket
(void)(reuse_address);

Check warning

Code scanning / CodeQL

Unchecked function argument Warning

This use of parameter reuse_address has not been checked.
NATIVE_INT_TYPE fd = -1;
SocketIpStatus status = SOCK_SUCCESS;
FW_ASSERT(m_fd == -1 and not m_open); // Ensure we are not opening an opened socket
// Open a TCP socket for incoming commands, and outgoing data if not using UDP
status = this->openProtocol(fd);
status = this->openProtocol(fd, reuse_address);

Check warning

Code scanning / CodeQL

Unchecked function argument Warning

This use of parameter reuse_address has not been checked.
this->m_lock.lock();
this->m_started = true;
this->m_lock.unLock();
return SOCK_SUCCESS;
}

SocketIpStatus IpSocket::open() {
SocketIpStatus IpSocket::open(const bool reuse_address) {

Check notice

Code scanning / CodeQL

Use of basic integral type Note

reuse_address uses the basic integral type bool rather than a typedef with size and signedness.
@@ -132,19 +132,19 @@
this->m_lock.unLock();
}

SocketIpStatus IpSocket::startup() {
SocketIpStatus IpSocket::startup(const bool reuse_address) {

Check notice

Code scanning / CodeQL

Use of basic integral type Note

reuse_address uses the basic integral type bool rather than a typedef with size and signedness.
@@ -170,6 +178,7 @@

Os::Task m_task;
bool m_reconnect; //!< Force reconnection
bool m_reuse_address; //!< Set REUSEADDR option, only for startSocketTask and readTask

Check notice

Code scanning / CodeQL

Use of basic integral type Note

m_reuse_address uses the basic integral type bool rather than a typedef with size and signedness.
}

SocketIpStatus SocketReadTask::open() {
SocketIpStatus status = this->getSocketHandler().open();
SocketIpStatus SocketReadTask::open(const bool reuse_address) {

Check notice

Code scanning / CodeQL

Use of basic integral type Note

reuse_address uses the basic integral type bool rather than a typedef with size and signedness.
// Note: the first step is for the IP socket to open the port
Os::Task::TaskStatus stat = m_task.start(name, SocketReadTask::readTask, this, priority, stack, cpuAffinity);
FW_ASSERT(Os::Task::TASK_OK == stat, static_cast<NATIVE_INT_TYPE>(stat));
}

SocketIpStatus SocketReadTask::startup() {
return this->getSocketHandler().startup();
SocketIpStatus SocketReadTask::startup(const bool reuse_address) {

Check notice

Code scanning / CodeQL

Use of basic integral type Note

reuse_address uses the basic integral type bool rather than a typedef with size and signedness.
@@ -39,7 +40,7 @@

TcpServerSocket::TcpServerSocket() : IpSocket(), m_base_fd(-1) {}

SocketIpStatus TcpServerSocket::startup() {
SocketIpStatus TcpServerSocket::startup(const bool reuse_address) {

Check notice

Code scanning / CodeQL

Use of basic integral type Note

reuse_address uses the basic integral type bool rather than a typedef with size and signedness.
@@ -97,7 +97,10 @@
return SOCK_SUCCESS;
}

SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd) {
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) {

Check notice

Code scanning / CodeQL

Use of basic integral type Note

fd uses the basic integral type int rather than a typedef with size and signedness.
@@ -97,7 +97,10 @@
return SOCK_SUCCESS;
}

SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd) {
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) {

Check notice

Code scanning / CodeQL

Use of basic integral type Note

reuse_address uses the basic integral type bool rather than a typedef with size and signedness.
{
// Enable Address reuse to avoid binding to the socket that still might be in TIME_WAIT state.
// Can happen if a function like get_free_port() is called to determine an available port by binding to port 0.
const int enable = 1;

Check notice

Code scanning / CodeQL

Use of basic integral type Note

enable uses the basic integral type int rather than a typedef with size and signedness.
{
// Enable Address reuse to avoid binding to the socket that still might be in TIME_WAIT state.
// Can happen if a function like get_free_port() is called to determine an available port by binding to port 0.
const int enable = 1;

Check notice

Code scanning / CodeQL

Use of basic integral type Note

enable uses the basic integral type int rather than a typedef with size and signedness.
Drv/Ip/UdpSocket.hpp Outdated Show resolved Hide resolved
@LeStarch
Copy link
Collaborator

I studied up on TCP and this is what I was able to deduce:

  1. TIME_WAITs happen on the initiator of the close/shutdown call. Thus clients should always initiate the shutdown
  2. close and/or shutdown is asynchronous unless the SO_LINGER option is set.

I think that means the following should be done to the UTs:

  1. Ensure the client always closes first and
  2. Make sure to use SO_LINGER on UT sockets. This will ensure that things are closed fully before reuse.

Since the error originates from the find port helper, it might be sufficient to just apply number 2 to the port bound there.

This work won't hurt, but may be masking the underlying issue. Thoughts?

Thanks for your patience while I study up!

@JohanBertrand
Copy link
Contributor Author

Thanks for the details. I just gave it a try. I set the reuse_address parameters to false (their default values), and set SO_LINGER on the port helper function, but I still get the same errors:

/home/fprime/Drv/TcpClient/test/ut/TcpClientTester.cpp:40: Failure
Expected equality of these values:
  serverStat
    Which is: -9
  SOCK_SUCCESS
    Which is: 0
TCP server startup error: Address already in use

I pushed the changes so that you can give it a try on your side. I agree with your analysis, but I don't understand why we're having such issues even with SO_LINGER activated and set to 0.

I think that we should go ahead with the reuse_address option, either as a function parameter, or as something fixed in the code.
I'm not sure if there is a real drawback of having it activated by default.

@LeStarch
Copy link
Collaborator

The issue I see is that setting SO_LINGER to 0 has almost no effect. This is because that value is the timeout, and a timeout of zero will exit immediately yielding the same problem. I have pushed a fix that sets it to 30.

@JohanBertrand
Copy link
Contributor Author

I have exactly the same error messages on my side with the timeout of 30

@LeStarch
Copy link
Collaborator

@JohanBertrand can you remind me again what your testing environment is? Is it a docker container? Any special networking? Host machine type?

I have yet to be able to reproduce the issue to have the failures appear reliably....but I want to get something to reproduce it.

@LeStarch
Copy link
Collaborator

@JohanBertrand As a reminder, this is an open forum. So keep the information succinct.

  1. native, VM, docker
  2. Host OS (mac/darwin/win), Guest OS (linux/other)
  3. Any special networking config

@JohanBertrand
Copy link
Contributor Author

I am currently running an ubuntu:20.04 docker instance on WSL2, with the script mentioned in #2706

I don't have any special network settings on the docker

Usually, if you run the test about 20 times, you should get at least one test failing

@LeStarch
Copy link
Collaborator

First of all, thank you. That was key-information and I can now reliably reproduce the problem. I had a thought on how to fix the issue altogether.

In short:

  1. Allow TcpServer to accept port "0" (choose a port)
  2. Allow TcpServer to return the port hosted at
  3. Update the test to skip port-selector altogether, and just choose any port and reuse that in the clients.

I'll take a crack at this tomorrow. In the meantime, we should determine if we want to keep reuse around (assuming this pans out).

@LeStarch
Copy link
Collaborator

@JohanBertrand I think I finally figured out some answers! Why doesn't the original code work? Why do the things that "should" fix the problem not fix the problem? Why does logic seem to fail when discussing theses issues? Why did port-selector need to recurse to avoid "root only" ports in a non-root program?

Here is the answer: notice the asymmetry

address.sin_port = htons(0);

when compared against:

U16 port = address.sin_port;

There is no network byte swap when reading back the port. This means that we might select a "good" port, and then byte-swap it into a bad port, root port, reserved port, etc.

  1. Why doesn't the original code work? The detected good port is corrupted before use, loosing its guarantee of availability.
  2. Why do things that should fix the issue fail? We were using SO_LINGER and other mechanisms on a port that was the pre-corrruption.
  3. Why does logic fail? Because logic was based on a bad understanding of which port was really used.
  4. Why could we get root only ports? Because post-swap could enter reserved space

I have addressed this, and added the capability for TcpServer and UdpRecv to use "0" (Os chooses a port) in this PR: #2739.

Would you mind reviewing?

@JohanBertrand
Copy link
Contributor Author

Awesome! Thanks you for looking into that and for providing those details.

I'm going to look into the fix in the PR.

FYI, we might still need to have a recursive call to the get_free_port function to be able to find two ports which are different for the UDP test (See the changes in this PR)

@LeStarch
Copy link
Collaborator

@JohanBertrand since we found the root cause, I am going to close this PR. If you feel there is still need for reconnect behavior, we can track that as a dedicated issue.

@LeStarch LeStarch closed this May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants