-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix failures of tcp UT #2709
Fix failures of tcp UT #2709
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like to see this as a defaulted argument to the openProtocol
call because this code affects UTs (where turning this option on is needed) and flight code where this option might be a vulnerability depending on the remote side of these TCP connections.
openProtocol(NATIVE_INT_TYPE& fd, bool reuse_address=false)
Since there is a potential for some plumbing through components, and changes to the UTs, please let me know if you'd like some assistance performing this work.
Thanks for you contributions!
@@ -57,12 +57,22 @@ SocketIpStatus TcpClientSocket::openProtocol(NATIVE_INT_TYPE& fd) { | |||
#if defined TGT_OS_TYPE_VXWORKS || TGT_OS_TYPE_DARWIN | |||
address.sin_len = static_cast<U8>(sizeof(struct sockaddr_in)); | |||
#endif | |||
if (reuse_address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the comments, I didn't expect this code in the client....only in the server. If that is true, then perhaps it is only needed in the startup
function and not the open
functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried to run the unit tests with the 4 different configurations, and it seems like both sockets are needing the REUSEADDR option to avoid errors. You can double check from your side using the procedure mentioned in the issue #2706 on the TcpClient UT.
To be honest, I am not an expert in TCP sockets, and I can not really explain the real reason behind this behavior, sadly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can attempt to explain why these errors occur. TCP has two endpoints (a sever and a client). When the connection is terminated there are two termination possibilities:
- Uncoordinated termination (typically caused by calling
close
before callingshutdown
) - Coordinated termination (caused by calling
shutdown
and thenclose
)
In the uncoordinated case, a port will end-up in TIME_WAIT*
state, where it waits for a specified timeout (typically 60s) to let any remote that was not properly informed to shutdown notice the connection is down before allowing the port to be reused. Otherwise, the client might send data to a reused port accidentally leaking data.
In the coordinated case, a message is sent across TCP informing the remote to shutdown and an acknowledgement is sent back. Then the port can be closed, and immediately reused.
These errors occur when the port cannot be reused immediately due to an uncoordinated shutdown, or an unacknowledged coordinated shutdown. Thus the port is unable to be reused, but the UTs expect it can.
Given the random nature of these errors, it is hard to reproduce with enough insight to determine which of these cases occur and exactly why. In an ideal world, the UTs would conduct a proper shutdown
always, and thus not need the ability to "reuse" waiting ports. However, your proposed solution is deeply practical as it quickly resolves the issue without trying to reproduce a random bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LeStarch Thanks for the explanations.
What would be the path forward for this MR?
Do you want me to investigate more on the proper shutdown of the sockets for the UT instead of having the reuse option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the reuse option as a fallback. Especially given the time required to figure out the cause of the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want any other changes in the current MR then?
I will anyways investigate on the proper shutdown of the UT in the future, and come back to you if I have any findings.
this->m_lock.lock(); | ||
this->m_started = true; | ||
this->m_lock.unLock(); | ||
return SOCK_SUCCESS; | ||
} | ||
|
||
SocketIpStatus IpSocket::open() { | ||
SocketIpStatus IpSocket::open(const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Long function without assertion Note
@@ -41,7 +41,7 @@ | |||
|
|||
TcpClientSocket::TcpClientSocket() : IpSocket() {} | |||
|
|||
SocketIpStatus TcpClientSocket::openProtocol(NATIVE_INT_TYPE& fd) { | |||
SocketIpStatus TcpClientSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Long function without assertion Note
@@ -39,7 +40,7 @@ | |||
|
|||
TcpServerSocket::TcpServerSocket() : IpSocket(), m_base_fd(-1) {} | |||
|
|||
SocketIpStatus TcpServerSocket::startup() { | |||
SocketIpStatus TcpServerSocket::startup(const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Long function without assertion Note
@@ -90,7 +102,10 @@ | |||
this->IpSocket::shutdown(); | |||
} | |||
|
|||
SocketIpStatus TcpServerSocket::openProtocol(NATIVE_INT_TYPE& fd) { | |||
SocketIpStatus TcpServerSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Long function without assertion Note
@@ -97,7 +97,10 @@ | |||
return SOCK_SUCCESS; | |||
} | |||
|
|||
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd) { | |||
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Long function without assertion Note
// Can happen if a function like get_free_port() is called to determine an available port by binding to port 0. | ||
const int enable = 1; | ||
if (setsockopt(socketFd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable)) < 0) { | ||
::close(socketFd); |
Check warning
Code scanning / CodeQL
Unchecked return value Warning
close
// Can happen if a function like get_free_port() is called to determine an available port by binding to port 0. | ||
const int enable = 1; | ||
if (setsockopt(serverFd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable)) < 0) { | ||
::close(serverFd); |
Check warning
Code scanning / CodeQL
Unchecked return value Warning
close
SocketIpStatus SocketReadTask::open() { | ||
SocketIpStatus status = this->getSocketHandler().open(); | ||
SocketIpStatus SocketReadTask::open(const bool reuse_address) { | ||
SocketIpStatus status = this->getSocketHandler().open(reuse_address); |
Check warning
Code scanning / CodeQL
Unchecked function argument Warning
SocketIpStatus SocketReadTask::startup() { | ||
return this->getSocketHandler().startup(); | ||
SocketIpStatus SocketReadTask::startup(const bool reuse_address) { | ||
return this->getSocketHandler().startup(reuse_address); |
Check warning
Code scanning / CodeQL
Unchecked function argument Warning
const Os::Task::ParamType priority, | ||
const Os::Task::ParamType stack, | ||
const Os::Task::ParamType cpuAffinity) { | ||
FW_ASSERT(not m_task.isStarted()); // It is a coding error to start this task multiple times | ||
FW_ASSERT(not this->m_stop); // It is a coding error to stop the thread before it is started | ||
m_reconnect = reconnect; | ||
m_reuse_address = reuse_address; |
Check warning
Code scanning / CodeQL
Unchecked function argument Warning
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd) { | ||
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) { | ||
// reuse_address is not applicable for the UDP socket | ||
(void)(reuse_address); |
Check warning
Code scanning / CodeQL
Unchecked function argument Warning
SocketIpStatus TcpServerSocket::openProtocol(NATIVE_INT_TYPE& fd) { | ||
SocketIpStatus TcpServerSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) { | ||
// reuse_address is not applicable for the TCP server socket | ||
(void)(reuse_address); |
Check warning
Code scanning / CodeQL
Unchecked function argument Warning
NATIVE_INT_TYPE fd = -1; | ||
SocketIpStatus status = SOCK_SUCCESS; | ||
FW_ASSERT(m_fd == -1 and not m_open); // Ensure we are not opening an opened socket | ||
// Open a TCP socket for incoming commands, and outgoing data if not using UDP | ||
status = this->openProtocol(fd); | ||
status = this->openProtocol(fd, reuse_address); |
Check warning
Code scanning / CodeQL
Unchecked function argument Warning
this->m_lock.lock(); | ||
this->m_started = true; | ||
this->m_lock.unLock(); | ||
return SOCK_SUCCESS; | ||
} | ||
|
||
SocketIpStatus IpSocket::open() { | ||
SocketIpStatus IpSocket::open(const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
@@ -132,19 +132,19 @@ | |||
this->m_lock.unLock(); | |||
} | |||
|
|||
SocketIpStatus IpSocket::startup() { | |||
SocketIpStatus IpSocket::startup(const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
@@ -170,6 +178,7 @@ | |||
|
|||
Os::Task m_task; | |||
bool m_reconnect; //!< Force reconnection | |||
bool m_reuse_address; //!< Set REUSEADDR option, only for startSocketTask and readTask |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
} | ||
|
||
SocketIpStatus SocketReadTask::open() { | ||
SocketIpStatus status = this->getSocketHandler().open(); | ||
SocketIpStatus SocketReadTask::open(const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
// Note: the first step is for the IP socket to open the port | ||
Os::Task::TaskStatus stat = m_task.start(name, SocketReadTask::readTask, this, priority, stack, cpuAffinity); | ||
FW_ASSERT(Os::Task::TASK_OK == stat, static_cast<NATIVE_INT_TYPE>(stat)); | ||
} | ||
|
||
SocketIpStatus SocketReadTask::startup() { | ||
return this->getSocketHandler().startup(); | ||
SocketIpStatus SocketReadTask::startup(const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
@@ -39,7 +40,7 @@ | |||
|
|||
TcpServerSocket::TcpServerSocket() : IpSocket(), m_base_fd(-1) {} | |||
|
|||
SocketIpStatus TcpServerSocket::startup() { | |||
SocketIpStatus TcpServerSocket::startup(const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
@@ -97,7 +97,10 @@ | |||
return SOCK_SUCCESS; | |||
} | |||
|
|||
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd) { | |||
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
@@ -97,7 +97,10 @@ | |||
return SOCK_SUCCESS; | |||
} | |||
|
|||
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd) { | |||
SocketIpStatus UdpSocket::openProtocol(NATIVE_INT_TYPE& fd, const bool reuse_address) { |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
{ | ||
// Enable Address reuse to avoid binding to the socket that still might be in TIME_WAIT state. | ||
// Can happen if a function like get_free_port() is called to determine an available port by binding to port 0. | ||
const int enable = 1; |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
{ | ||
// Enable Address reuse to avoid binding to the socket that still might be in TIME_WAIT state. | ||
// Can happen if a function like get_free_port() is called to determine an available port by binding to port 0. | ||
const int enable = 1; |
Check notice
Code scanning / CodeQL
Use of basic integral type Note
I studied up on TCP and this is what I was able to deduce:
I think that means the following should be done to the UTs:
Since the error originates from the find port helper, it might be sufficient to just apply number 2 to the port bound there. This work won't hurt, but may be masking the underlying issue. Thoughts? Thanks for your patience while I study up! |
Thanks for the details. I just gave it a try. I set the
I pushed the changes so that you can give it a try on your side. I agree with your analysis, but I don't understand why we're having such issues even with I think that we should go ahead with the |
The issue I see is that setting |
I have exactly the same error messages on my side with the timeout of 30 |
@JohanBertrand can you remind me again what your testing environment is? Is it a docker container? Any special networking? Host machine type? I have yet to be able to reproduce the issue to have the failures appear reliably....but I want to get something to reproduce it. |
@JohanBertrand As a reminder, this is an open forum. So keep the information succinct.
|
I am currently running an I don't have any special network settings on the docker Usually, if you run the test about 20 times, you should get at least one test failing |
First of all, thank you. That was key-information and I can now reliably reproduce the problem. I had a thought on how to fix the issue altogether. In short:
I'll take a crack at this tomorrow. In the meantime, we should determine if we want to keep reuse around (assuming this pans out). |
@JohanBertrand I think I finally figured out some answers! Why doesn't the original code work? Why do the things that "should" fix the problem not fix the problem? Why does logic seem to fail when discussing theses issues? Why did port-selector need to recurse to avoid "root only" ports in a non-root program? Here is the answer: notice the asymmetry fprime/Drv/Ip/test/ut/PortSelector.cpp Line 23 in 0fe467c
when compared against: fprime/Drv/Ip/test/ut/PortSelector.cpp Line 41 in 0fe467c
There is no network byte swap when reading back the port. This means that we might select a "good" port, and then byte-swap it into a bad port, root port, reserved port, etc.
I have addressed this, and added the capability for TcpServer and UdpRecv to use "0" (Os chooses a port) in this PR: #2739. Would you mind reviewing? |
Awesome! Thanks you for looking into that and for providing those details. I'm going to look into the fix in the PR. FYI, we might still need to have a recursive call to the |
@JohanBertrand since we found the root cause, I am going to close this PR. If you feel there is still need for reconnect behavior, we can track that as a dedicated issue. |
Change Description
Fix issue where the TCP client/server would fail to connect to a socket, generating an error and making the UT to fail.
Partially solves #2706, as this does not solve the warnings/silent failures
Rationale
The TCP unit tests should pass in a reliable manner.
This also can affect the flight code if the "bind to 0" strategy is used to select a port.