Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: mux simulator cannot handle new requests due to the listen backlog full #16495

Closed
lolyu opened this issue Jan 13, 2025 · 0 comments · Fixed by #16651
Closed

[Bug]: mux simulator cannot handle new requests due to the listen backlog full #16495

lolyu opened this issue Jan 13, 2025 · 0 comments · Fixed by #16651

Comments

@lolyu
Copy link
Contributor

lolyu commented Jan 13, 2025

Issue Description

dualtor nightly suffers from http request timeout:

2025 Jan 13 11:40:09.678044 str2-7050cx3-acs-10 WARNING pmon#CCmisApi: y_cable_port 5: attempt=3, GET http://10.64.246.133:8080/mux/vms21-3/8 for physical_port 5 failed with TimeoutError('timed out')
2025 Jan 13 11:40:09.678044 str2-7050cx3-acs-10 NOTICE pmon#CCmisApi: y_cable_port 5: Sleep 1 seconds to retry GET http://10.64.246.133:8080/mux/vms21-3/8 for physical port 5
2025 Jan 13 11:40:13.073700 str2-7050cx3-acs-10 WARNING pmon#CCmisApi: y_cable_port 1: attempt=6, GET http://10.64.246.133:8080/mux/vms21-3/0 for physical_port 1 failed with TimeoutError('timed out')
2025 Jan 13 11:40:13.073700 str2-7050cx3-acs-10 WARNING pmon#CCmisApi: y_cable_port 1: Retry GET http://10.64.246.133:8080/mux/vms21-3/0 for physical port 1 timeout after 30 seconds, attempted=6
2025 Jan 13 11:40:13.073700 str2-7050cx3-acs-10 WARNING pmon#CCmisApi: Error: Could not get active side for cli command show mux hwmode muxdirection logical port Ethernet0 and physical port 1

Mux simulator cannot handle new requests to to the listen backlog is full; it appeared that there are too many connections in CLOSE_WAIT; so mux simulator cannot accept new connections due to this socket fd leak.

# ss -lnt | grep 8080
LISTEN   1025         1024                0.0.0.0:8080             0.0.0.0:*
# netstat -s | grep -i listen
    472273 times the listen queue of a socket overflowed
    472277 SYNs to LISTEN sockets dropped
# netstat -ant | grep 8080 | grep CLOSE_WAIT | wc -l
1313

how to reproduce?

The CLOSE_WAIT sockets are created by the HTTP requests from the following code:

def _post(server_url, data):
"""
Helper function for posting data to y_cable server.
Args:
server_url: a str, the full address of mux server, like http://10.0.0.64:8080/mux/vms17-8[/1/drop|output]
data: data to post {"out_sides": ["nic", "upper_tor", "lower_tor"]}
Returns:
True if succeed. False otherwise
"""
try:
session = Session()
if "allowed_methods" in inspect.signature(Retry).parameters:
retry = Retry(total=3, connect=3, backoff_factor=1,
allowed_methods=frozenset(['GET', 'POST']),
status_forcelist=[x for x in requests.status_codes._codes if x != 200])
else:
retry = Retry(total=3, connect=3, backoff_factor=1,
method_whitelist=frozenset(['GET', 'POST']),
status_forcelist=[x for x in requests.status_codes._codes if x != 200])
session.mount('http://', HTTPAdapter(max_retries=retry))
server_url = '{}?reqId={}'.format(server_url, uuid.uuid4()) # Add query string param reqId for debugging
logger.debug('POST {} with {}'.format(server_url, data))
headers = {'Accept': 'application/json', 'Content-Type': 'application/json'}
resp = session.post(server_url, json=data, headers=headers, timeout=10)
logger.debug('Received response {}/{} with content {}'.format(resp.status_code, resp.reason, resp.text))
return resp.status_code == 200
except Exception as e:
logger.warn("POST {} with data {} failed, err: {}".format(server_url, data, repr(e)))
return False

It appeared that, the toggle-all command that calls this function can take more than 10 seconds to finish, which causes the post request timeout and close the tcp connection on the client side. The TCP socket on the server side (mux simulator) transits into CLOSE_WAIT and mux simulator doesn't close them, which results in a socket fd leak.

Results you see

As the description.

Results you expected to see

As the description.

Is it platform specific

generic

Relevant log output

As the description.

Output of show version

As the description.

Attach files (if any)

No response

yxieca pushed a commit that referenced this issue Jan 24, 2025
This is to fix #16495.
The root cause is that, on dualtor testbed, the post request to mux_simulator to toggle all mux ports can take up to 15+ seconds to finish; but the post timeout is 10s, so the post request to toggle all mux ports always timeout.

Here is the workflow for a HTTP requests to mux_simulator:

#
# how does a HTTP request to mux_simulator work?
#         +------------+       +------------+      +-------------+       +--------------+
# ------->|listen queue+------>|accept queue+----->|mux simulator+------>| mux simulator|
#         +------------+       +------------+      |  dispatcher |       |handler worker|
#                                                  +-------------+       +--------------+
The HTTP requests are handled by TCP listen/accept queue first; after the connection is established, mux_simulator will consume the request and dispatch the request to workers to handle the request.
Back to the toggle all mux ports post request, the client times out and disconnects, but the request has already been dispatched by the mux_simulator, its handler worker is still busy handling the request. If our test code send more timeout requests to mux_simulator, those inundated requests could overload the mux_simulator handler workers and a kind of denial-of-service occurs. Even worse, with more HTTP requests coming, the listen/accept queue will overflow.

This PR is to stop creating timeout requests to mux_simulator.

Signed-off-by: Longxiang Lyu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant