[BUG] Failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded #389

pallavi-mandole · 2023-11-07T09:55:38Z

Describe the bug
reconciler failure was reported when we tried to scale in/out pods.
reconciler job scheduled for every 5 minutes but failed to execute with the given error -
[error] failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded.
[error] failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded
[verbose] reconciler failure: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded.

Current Behavior
the deployment of pod replicas to 500, the same for ippam podreferences. But when we scale in replicas to 1, pods are scaled successfully but 130 podreferences are left. I did 2 series of scale in/out and then release uninstall and redeployment - the same issue every time: 130 podreferences left undeleted after scale in.

To Reproduce
Steps to reproduce the behavior:

the deployment of pod replicas to 500, the same for ippam podreferences.
But when we scale in replicas to 1, pods are scaled successfully but 130 podreferences are left.
Do these 2 series of scale in/out and then release uninstall and redeployment - the same issue every time: 130 podreferences left undeleted after a scale in.

Environment:

Whereabouts version : 0.6.2
Kubernetes version (use kubectl version): N/A
Network-attachment-definition: N/A
Whereabouts configuration (on the host): N/A
OS (e.g. from /etc/os-release): N/A
Kernel (e.g. uname -a): N/A
Others: N/A

Additional info / context
Add any other information / context about the problem here.

The text was updated successfully, but these errors were encountered:

adilGhaffarDev · 2023-11-23T11:45:30Z

@dougbtv kindly check this issue.

smoshiur1237 · 2023-12-05T14:04:28Z

Got a respone from @dougbtv, get suggestion to disable overlapping IP addresses. We are checking, if this issue can be fixed by disabling it.
https://github.com/k8snetworkplumbingwg/whereabouts/tree/master#overlapping-ranges

smoshiur1237 · 2023-12-07T08:00:18Z

@dougbtv @andreaskaris
We are using the overlapping ip ranges feature for the k8s storage backend and disabling it will not serve our goal. I think this is a proper bug which needs your attention and help to overcome the issue.

smoshiur1237 · 2023-12-07T13:40:22Z

I would like to explain the issue here in steps for better view:

IPs are stored as the ippools CRD of whereabouts, There is an overlapping IP ranges CRD used to store all IPs.[Which creates 1 object for each IP]
The issue is when reconciler code tries to fetch all the objects in under the overlappingipranges. When the number of IPs grow, the number of overlappingipranges objects grow(one for each IP)
Reconciler on the initialization of the the first executor objects tries to list all objects under the overlappingipranges CRD, and hits a context-deadline.

Link of the code where we get the error: code

adilGhaffarDev · 2024-03-13T13:23:05Z

if it's related to timeout we can fix it in 2 ways:

Increase timeout
Add pagination to ListOptions, I have created PR with this fix(🌱 Adding pagination to ListOverlappingIPs to fix context deadline error. #438) @dougbtv please check. It should not affect any current workflows.

smoshiur1237 · 2024-05-21T11:18:04Z

/cc @manuelbuil
Hi, We are facing this issue for longtime and couldn't find a fix or workaround to this issue. Would you please take a look of this issue?

marseel · 2024-05-24T09:04:43Z

Hi all, I am coming from k8s sig-scalability to help you with fixing this issue.

Error client rate limiter Wait returned an error: context deadline exceeded indicates that there were many requests issued at the same time and they were throttled on client-side (nothing do to with k8s control plane itself).

This particular PR: #438 won't really help, it will even make things probably worse, as you will be issuing more calls.

So how it works:

client-go rate-limits qps of requests - default is 5 qps
I am guessing you are issuing a batch of requests (probably 1 per pod in this case?), which are queued in client-go. So with 500 requests and 5 qps, it will take 100 seconds for them to finish
Then you also issue List request that is added to the end of requests queue with some timeout on context, but because you already have 100s worth of waiting requests in queue and also set timeout, it receives context deadline exceeded client rate limiter Wait returned an error: context deadline exceeded. So essentially that request doesn't even reach k8s control plane and times out on client-side, just like the error states rate limiter Wait ... context deadline exceeded

Timeout that you now setting in the context, is not just requests timeout, but timeout for the sum of "waiting in a queue" and "request timeout".
If you want to have a timeout for request only, you can specify it by setting TimeoutSeconds in meta.v1.ListOptions instead.
For List operations, I would recommend setting the timeout mentioned above to 1 minute, as the official SLO for List requests states 30s for 99% of requests: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md#definition
(this SLO is just for requests, not waiting in queue).

On a high-level design, I would recommend not using List at all and use informer instead.

Hope this helps 🤞

mlguerrero12 · 2024-05-30T07:58:58Z

Hi @smoshiur1237, I'll work on this.

smoshiur1237 · 2024-05-30T08:34:59Z

@mlguerrero12 Thanks, I proposed a fix by increasing the RequestTimeout. Would you please take a look, It should fix the issue.
PR

mlguerrero12 · 2024-05-30T08:45:53Z

It might fix the issue for 500 pods but as I mentioned in your pr, we have a customer reporting this issue with 100 nodes and 30k pods. I'll explore other options and let you know.

mlguerrero12 · 2024-05-30T15:03:55Z

@smoshiur1237, I don't believe this issue can be solved by increasing the request timeout. Also, the reconciler job doesn't make a batch of requests before listing the cluster wide reservations. What it does is to list the pods and ip pools.

The root of the issue is that the this reconciler job is expected to finish in 30 seconds. A context is created with this time and is used as parent for contexts of all requests. So, in large clusters, this parent context expires by the time it has to list the cluster wide reservations. If you check the logic for listing pods, it uses the same time that was set for the parent (30s).

What I'm gonna do is remove this parent context and use 30 seconds for all listing operations (supported by what @marseel mentioned above the SLO). All other type of requests will continue using the RequestTimeout value (10s). I will also use pagination for listing pods and cluster wide reservations.

mlguerrero12 · 2024-06-12T14:49:56Z

@smoshiur1237, @adilGhaffarDev, could you please share complete logs of the reconciler when this issue happens?

smoshiur1237 · 2024-06-13T08:29:21Z

@mlguerrero12 here is the original logs that we got initially on this issue from a running whereabouts pod logs, where I have trimmed similar kinds of instances to make it visible :

2023-10-27T11:50:38Z [debug] the IP reservation: IP: x:y::1:2c is reserved for pod: bat-t1/cnf-complex-t1-2-stor-rwo-0
2023-10-27T11:50:38Z [debug] pod reference bat-t1/cnf-complex-t1-2-stor-rwo-0 matches allocation; Allocation IP: x:y::1:2c; PodIPs: map[x.y.1.44:{} x:y::1:2c:{}]
2023-10-27T11:50:38Z [error] failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded
2023-10-27T11:50:38Z [error] failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded
2023-10-27T11:50:38Z [verbose] reconciler failure: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded
2023-10-27T11:55:00Z [verbose] starting reconciler run
2023-10-27T11:55:00Z [debug] NewReconcileLooper - inferred connection data
2023-10-27T11:55:00Z [debug] listing IP pools
2023-10-27T11:55:37Z [debug] Added IP x.y.1.130 for pod bat-t1/cnf-complex-t1-1-dpdk-0
2023-10-27T11:55:37Z [debug] Added IP x.y.1.112 for pod bat-t1/cnf-complex-t1-1-dpdk-0
2023-10-27T11:55:37Z [debug] the IP reservation: IP: x:y::1:25 is reserved for pod: bat-t1/cnf-complex-t1-1-stor-rwo-1
2023-10-27T11:55:37Z [debug] pod reference bat-t1/cnf-complex-t1-1-stor-rwo-1 matches allocation; Allocation IP: x:y::1:25; PodIPs: map[x.y.1.37:{} x:y::1:25:{}]
2023-10-27T11:55:37Z [error] failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded
2023-10-27T11:55:37Z [error] failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded
2023-10-27T11:55:37Z [verbose] reconciler failure: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded

Fixes k8snetworkplumbingwg#389 Signed-off-by: Marcelo Guerrero <[email protected]>

Parent timeout context of 30s was removed. All listing operations used by the cronjob reconciler has 30s as timeout. Fixes k8snetworkplumbingwg#389 Signed-off-by: Marcelo Guerrero <[email protected]>

smoshiur1237 · 2024-06-20T12:43:01Z

Opened a new issue to track pod reference problem:
#483

Parent timeout context of 30s was removed. All listing operations used by the cronjob reconciler has 30s as timeout. Fixes k8snetworkplumbingwg/whereabouts#389 Signed-off-by: Marcelo Guerrero <[email protected]>

adilGhaffarDev mentioned this issue Mar 13, 2024

🌱 Adding pagination to ListOverlappingIPs to fix context deadline error. #438

Closed

smoshiur1237 mentioned this issue May 21, 2024

Failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded kubernetes/kubernetes#125011

Closed

smoshiur1237 mentioned this issue May 30, 2024

Increase RequestTimout to fix overlappingIP context deadline error #478

Closed

mlguerrero12 added a commit to mlguerrero12/whereabouts that referenced this issue Jun 14, 2024

Align api calls timeouts cronjob ip reconciler

7de75c3

Fixes k8snetworkplumbingwg#389 Signed-off-by: Marcelo Guerrero <[email protected]>

mlguerrero12 mentioned this issue Jun 14, 2024

Align api calls timeouts cronjob ip reconciler #480

Merged

mlguerrero12 closed this as completed in #480 Jul 1, 2024

mlguerrero12 closed this as completed in 61df92c Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded #389

[BUG] Failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded #389

pallavi-mandole commented Nov 7, 2023

adilGhaffarDev commented Nov 23, 2023

smoshiur1237 commented Dec 5, 2023 •

edited

Loading

smoshiur1237 commented Dec 7, 2023

smoshiur1237 commented Dec 7, 2023 •

edited

Loading

adilGhaffarDev commented Mar 13, 2024

smoshiur1237 commented May 21, 2024

marseel commented May 24, 2024 •

edited

Loading

mlguerrero12 commented May 30, 2024

smoshiur1237 commented May 30, 2024

mlguerrero12 commented May 30, 2024 •

edited

Loading

mlguerrero12 commented May 30, 2024

mlguerrero12 commented Jun 12, 2024

smoshiur1237 commented Jun 13, 2024

smoshiur1237 commented Jun 20, 2024

[BUG] Failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded #389

[BUG] Failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded #389

Comments

pallavi-mandole commented Nov 7, 2023

adilGhaffarDev commented Nov 23, 2023

smoshiur1237 commented Dec 5, 2023 • edited Loading

smoshiur1237 commented Dec 7, 2023

smoshiur1237 commented Dec 7, 2023 • edited Loading

adilGhaffarDev commented Mar 13, 2024

smoshiur1237 commented May 21, 2024

marseel commented May 24, 2024 • edited Loading

mlguerrero12 commented May 30, 2024

smoshiur1237 commented May 30, 2024

mlguerrero12 commented May 30, 2024 • edited Loading

mlguerrero12 commented May 30, 2024

mlguerrero12 commented Jun 12, 2024

smoshiur1237 commented Jun 13, 2024

smoshiur1237 commented Jun 20, 2024

smoshiur1237 commented Dec 5, 2023 •

edited

Loading

smoshiur1237 commented Dec 7, 2023 •

edited

Loading

marseel commented May 24, 2024 •

edited

Loading

mlguerrero12 commented May 30, 2024 •

edited

Loading