Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random outbound connection timeouts based on server load #587

Open
mavci opened this issue Jun 23, 2022 · 16 comments
Open

Random outbound connection timeouts based on server load #587

mavci opened this issue Jun 23, 2022 · 16 comments

Comments

@mavci
Copy link

mavci commented Jun 23, 2022

Hello,

We've been experiencing random outbound connection timeouts based on server load for a very long time. After restarting the server, the problems go away, but after a while the timeouts start again. After some research I found these issues related to this topic:

docker/for-win#8861
docker/for-mac#3448
docker/for-mac#6086
docker/for-win#12671
docker/for-win#12761

I didn't put the technical logs here because these issues contain the relevant logs and results that I already have.
Anyone else having similar issues? And how can we fix it, any suggestions?

Thank you.

@rossinineto
Copy link

I reported in https://github.com/docker/for-win/issues/8861

@Spenhouet
Copy link

Spenhouet commented Jun 28, 2022

@djs55 Contrary to the title of this issue, this is not actually random at all, is very reproducible and has nothing to do with server load. Because of this our application no longer works with any version above docker desktop 4.5.0 on Mac and 4.5.1 on Windows. We now have to force all customers to downgrade. This is rather serious for us. We will share a reproduction soon.

EDIT: We might open a new issue since this one is so unspecific and off from the actual issue.

@jdeitrick80
Copy link

jdeitrick80 commented Sep 5, 2022

I have also seen this issue in versions >4.5.1 including the latest version, but have found that it can also be triggered with low amounts of traffic. The following is how I have been able to reproduce the issue.

docker run --name session-test -it -v /mnt/c/Users/jde/sessions/test:/test python:buster bash
root@17c8b33e70e6:/# pip install --quiet requests
root@17c8b33e70e6:/# cd test/
root@17c8b33e70e6:/test# python sessions.py
10:52:18: request 1
Request complete, sleep 30
10:52:50: request 2
Request complete, sleep 120
10:54:51: request 3
Request complete, sleep 420
11:01:51: request 4
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

sessions.py

import requests
from datetime import datetime
import time

s = requests.Session()

steps = [30, 120, 420, 420]
step = 1
for i in steps:
    print(datetime.now().strftime("%H:%M:%S") + ": request " + str(step))
    r3 = s.get('https://wttr.in')
    print("Request complete, sleep " + str(i))
    step+=1
    time.sleep(i)

As has been mentioned before if I look at a trace from the containers point of view I only see TCP SYNs being sent out during the 4th attempt after waiting 420s since the last request. Also if I kill the vpnkit while it is still trying the 4th attempt then when the vpnkit starts back up the 4th requests is able to complete successfully.

Some things that I have noticed that I do not think were previously mentioned. If I look at a trace from the host I see the TCP SYNs going out and TCP SYN ACKs coming back from the server, but these are not passed on to the container. If I start up another container while the first is trying unsuccessfully to do the 4th attempt it also is not able to reach the same destination, but is able to reach other destinations.

docker run -it python:buster bash
root@14437db6e250:/# curl https://wttr.in
curl: (7) Failed to connect to wttr.in port 443: Connection timed out
root@14437db6e250:/# curl https://google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://www.google.com/">here</A>.
</BODY></HTML>
root@14437db6e250:/# curl https://wttr.in
curl: (7) Failed to connect to wttr.in port 443: Connection timed out

The cause of the issue seems to have something to do with using sessions and having a client side keep alive interval being >=60s. If I change to a 30s client keep alive interval I do not run into the issue.

docker run --name session-test -it -v /mnt/c/Users/jde/sessions/test:/test python:buster bash
root@425db0e6590a:/# pip install --quiet requests
root@425db0e6590a:/# cd test/
root@425db0e6590a:/test# python sessions-ka30.py
11:41:35: request 1
Request complete, sleep 30
11:42:06: request 2
Request complete, sleep 120
11:44:06: request 3
Request complete, sleep 420
11:51:06: request 4
Request complete, sleep 420
root@425db0e6590a:/test#

sessions-ka30.py

import requests
from datetime import datetime
import time

import socket
from requests.adapters import HTTPAdapter

class HTTPAdapterWithSocketOptions(HTTPAdapter):
    def __init__(self, *args, **kwargs):
        self.socket_options = kwargs.pop("socket_options", None)
        super(HTTPAdapterWithSocketOptions, self).__init__(*args, **kwargs)

    def init_poolmanager(self, *args, **kwargs):
        if self.socket_options is not None:
            kwargs["socket_options"] = self.socket_options
        super(HTTPAdapterWithSocketOptions, self).init_poolmanager(*args, **kwargs)

KEEPALIVE_INTERVAL = 30
adapter = HTTPAdapterWithSocketOptions(socket_options=[(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, KEEPALIVE_INTERVAL), (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, KEEPALIVE_INTERVAL)])
s = requests.Session()
s.mount("http://", adapter)
s.mount("https://", adapter)

steps = [30, 120, 420, 420]
step = 1
for i in steps:
    print(datetime.now().strftime("%H:%M:%S") + ": request " + str(step))
    r3 = s.get('https://wttr.in')
    print("Request complete, sleep " + str(i))
    step+=1
    time.sleep(i)

I hope this information helps in resolving the issue or provides a work around for others experiencing it.

I have also added this information to https://github.com/docker/for-win/issues/8861

@rossinineto
Copy link

It´s a issue opened 23 June and it remains unsolved.
The related issues above dated from more than a year.

@nk9
Copy link

nk9 commented Nov 8, 2022

I'm also running into this on Mac. The problem gets progressively more frequent until the Docker Desktop process (and thus vpnkit) is restarted. I've gone back to Docker Desktop 4.5.0 for the time being. Would really like to see this resolved so we can begin upgrading Docker again.

@robertnisipeanu
Copy link

This issue was fixed for me on MacOS after editing ~/Library/Group\ Containers/group.com.docker/settings.json and setting vpnKitMaxPortIdleTime from 300 to 0 (Docker Desktop restart required after). I have changed this over a week ago and till now I did not encounter the issue again.

@mavci
Copy link
Author

mavci commented Nov 8, 2022

I was having this issue on linux server (not Docker Desktop) and it was fixed by changing network mode to host. But I see this as workaround, because we expect it to work normally with vpnkit and bridge network mode as well. I am still waiting for some update and fix about this issue.

@DirkvanWijk
Copy link

I can also confirm that since Docker 4.6 I'm having the same issues.

@nk9
Copy link

nk9 commented Dec 13, 2022

Sorry for the unsolicited tag, @djs55 and @avsm, but I was hoping to get some visibility on this. It's affecting lots of Docker Desktop users, who are sticking with 4.5 (February 2022) for now as a workaround. There are repro steps here and in the linked issue (docker/for-win#8861), but I haven't seen any acknowledgement that the VPNKit team is aware of the issue. Apologies if I missed it!

@tristanbrown
Copy link

This is still an issue. Worsening timeouts in Docker containers after they've been running for a few days. This bug is affecting countless developers across all fields, and should be prioritized.

@tutcugil
Copy link

We have the same issue, is there any update on this? It is very critical bug and still isn't resolved

@djs55
Copy link
Collaborator

djs55 commented Apr 18, 2023

I've got an experimental developer build which might help. If you'd like to try it, it's here:

@tutcugil
Copy link

I've got an experimental developer build which might help. If you'd like to try it, it's here:

hello @djs55, thank you, i will try and inform you

Best Regards

@tutcugil
Copy link

I've got an experimental developer build which might help. If you'd like to try it, it's here:

hello @djs55, thank you, i will try and inform you

Best Regards

@djs55, our networking problem seems to resolve so far, we are still observing,
Will this experimental build release soon on docker?

@oleh-hudyma
Copy link

@djs55 when can we expect for release version of this build?

I see that latest build version is 4.19 does not have this fixed

@MrCSharp22
Copy link

MrCSharp22 commented Jun 18, 2023

This remains to be an issue on my end. Running docker in WSL2 or old-school using Hyper-V on Windows 11. Version 4.19 didn't resolve the problem. Currently on version 4.20.1 (Build 110738) and the frequency of outbound connection timeouts (including DNS queries to various DNS servers) seems to have increased.

Any updates to this will be very appreciated. I am having to restart the docker instance every couple of hours and this is very inconvenient.

Edit:

Adding some more info on fixes I tried so far:

  • Setting vpnKitMaxPortIdleTime to 0 from 300 didn't fix it.
  • Disabling vpnKit through settings.json didn't help.
  • Increased the number of ephemeral TCP/UDP ports that Windows can open for outbound connection to ~32k didn't help (by default, Windows allows for 16k).
  • Running docker on latest WSL2 or Hyper-V has the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests