k8s: check and possibly optimise the launch of pending pods #681

tiborsimko · 2022-12-04T20:30:13Z

There was a situation in a cluster running many concurrent workflows, which generated many jobs, that many jobs were pending, because the cluster did not have enough memory resources to run them all.

For example, here is one snapshot in time:

$ kgp | grep reana-run-j | grep -c Running
110

$ kgp | grep reana-run-j | grep -c Pending
71

This means that only 60% of jobs could be running, the remaining 40% were pending. (Some for many hours.)

Some nodes were really busy, for example:

$ kubectl top nodes -l reana.io/system=runtimejobs
NAME                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
mycluster-node-12   4175m        52%    10308Mi         73%

$ kgp | grep node-12
reana-run-job-bda3d5d0-23da-494d-9973-5992707fed3f-hg8js          1/1     Running     0          11h     10.100.164.56    mycluster-node-12   <none>           <none>
reana-run-job-cd68d874-5bf5-44ba-a9bf-d4425ed5f466-prslv          1/1     Running     0          4h      10.100.164.12    mycluster-node-12   <none>           <none>
reana-run-job-ea877fa3-e346-4e3b-92d0-d7dabf2ce66b-sm695          1/1     Running     0          15h     10.100.164.63    mycluster-node-12   <none>           <none>
reana-run-job-fdac7a57-7e53-434b-a7e9-0231803bbcfa-8gnd2          1/1     Running     0          11h     10.100.164.14    mycluster-node-12   <none>           <none>

However, other nodes were less so, for example:

$ kubectl top nodes -l reana.io/system=runtimejobs
NAME                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
mycluster-node-33   3156m        39%    8596Mi          61%
mycluster-node-35   2073m        25%    5263Mi          37%

$ kgp | grep node-33
reana-run-job-4f95d722-7a41-456c-9437-4c7842f416aa-fh428          1/1     Running       0          21m     10.100.65.74     mycluster-node-33   <none>           <none>
reana-run-job-751198a7-3596-473a-bb29-7ab9e97456ad-2jscw          1/1     Running       0          63m     10.100.65.66     mycluster-node-33   <none>           <none>
reana-run-job-9ab284e3-2e9e-47f2-be86-37664109fcb4-p2z6p          1/1     Running       0          63m     10.100.65.116    mycluster-node-33   <none>           <none>

$ kgp | grep node-35
reana-run-job-9659c2d7-403a-4392-8b7b-1d1525be3bee-lh9bq          1/1     Running     0          5m13s   10.100.12.239    mycluster-node-35   <none>           <none>
reana-run-job-9c71d885-3e6c-423d-a4ad-d3da2e9df5d8-rb69x          1/1     Running     0          3m42s   10.100.12.244    mycluster-node-35   <none>           <none>

It seems that our pending pods aren't consumed as rapidly as they in theory could (e.g. the node-33 and node-35 above had free capacity).

Here is one such Pending pod described:

$ kubectl describe pod reana-run-job-ea984ce5-6f04-4b92-893e-6d271d2a5454-22gnp | tail -7
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  81m    default-scheduler  0/62 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 14 node(s) didn't match Pod's node affinity/selector, 39 Insufficient memory, 8 node(s) were unschedulable.
  Warning  FailedScheduling  11m    default-scheduler  0/62 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 14 node(s) didn't match Pod's node affinity/selector, 38 Insufficient memory, 8 node(s) were unschedulable.
  Warning  FailedScheduling  5m44s  default-scheduler  0/62 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 14 node(s) didn't match Pod's node affinity/selector, 39 Insufficient memory, 8 node(s) were unschedulable.

Let's verify our Kubernetes cluster settings related to the behaviour of Pending pods and let's see whether we could make the memory-checks and the scheduling of these pending pods faster.

The text was updated successfully, but these errors were encountered:

tiborsimko · 2023-04-25T13:15:10Z

More information about situations of this kind.

Here is a node which appears memory-busy:

$ kubectl top nodes | grep 'node-3 '
mycluster-node-3     4521m        56%    10261Mi         73%

and it is running four user jobs:

$ kgp | grep 'node-3 '
reana-run-job-232b1162-c04e-46d6-9018-843ee84a072c-g5mzz          2/2     Running     0          14h     10.0.0.0    mycluster-node-3    <none>           <none>
reana-run-job-6a86b3ae-c0a1-4b14-ae53-e8b8fb15d97a-5thvb          2/2     Running     0          4h6m    10.0.0.0    mycluster-node-3    <none>           <none>
reana-run-job-a3137e4c-48c7-4d3a-a073-335b18761231-bjpns          2/2     Running     0          4m5s    10.0.0.0    mycluster-node-3    <none>           <none>
reana-run-job-ee000c25-32f1-4375-900c-e4ae8a9fb30c-wrmxt          2/2     Running     0          11m     10.0.0.0    mycluster-node-3    <none>           <none>

Each job requires 3Gi memory, i.e. about 12Gi total, but the jobs actually use much less, about 5Gi only:

$ kgp | grep 'node-3 ' | awk '{print $1}' | xargs -n1 kubectl top pod --no-headers
reana-run-job-232b1162-c04e-46d6-9018-843ee84a072c-g5mzz   1012m   1300Mi
reana-run-job-6a86b3ae-c0a1-4b14-ae53-e8b8fb15d97a-5thvb   965m   1239Mi
reana-run-job-a3137e4c-48c7-4d3a-a073-335b18761231-bjpns   941m   1167Mi
reana-run-job-ee000c25-32f1-4375-900c-e4ae8a9fb30c-wrmxt   956m   1114Mi

This is because the jobs are by default requiring 3 Gi if the user does not specify any other value.

If, instead of silently adding 3 Gi memory requirement to each job, we let the jobs consume as much memory as they wish, and have a parallel "memory watcher" daemonset on the nodes that would monitor and kill any user job pods if these start to consume a lot of memory, we would be able to pack twice as many jobs on nodes as we do now, in this very example.

(And, if a user does require some 4 Gi, we would simply respect that. This would only change the default behaviour. when a user does not ask for any specific memory limit)

mrceyhun · 2023-06-21T00:29:05Z

Sorry to jump in!

Average memory and (max)PeakRSS metric monitoring might help to define min request.memory and max limits.memory. Because, in the node-3 example, let's assume this scenario (I don't know this can be the worst-case scenario or common case):

Minimum default memory defined like 2Gi and one job is started to run with 2Gi memory usage
Node's total memory usage becomes (10+2)12Gi and its max resource is 14Gi
One of the previous jobs "momentarily" requires to use 2+Gi memory additional to its ~1Gi memory usage
So, it will be killed/restarted because there is not enough memory

tiborsimko added the priority/soon label Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s: check and possibly optimise the launch of pending pods #681

k8s: check and possibly optimise the launch of pending pods #681

tiborsimko commented Dec 4, 2022

tiborsimko commented Apr 25, 2023

mrceyhun commented Jun 21, 2023

k8s: check and possibly optimise the launch of pending pods #681

k8s: check and possibly optimise the launch of pending pods #681

Comments

tiborsimko commented Dec 4, 2022

tiborsimko commented Apr 25, 2023

mrceyhun commented Jun 21, 2023