Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web hook pod (frr-k8s-webhook-server) is restarting at least 3 times before healthy #494

Open
karampok opened this issue Sep 19, 2024 · 2 comments

Comments

@karampok
Copy link

When running the E2E to check the frr-k8s https://github.com/metallb/metallb-operator/blob/main/test/e2e/functional/tests/e2e.go#L282

Test is green but the pod is restarting before it becomes healthy/ready

 -n metallb-system get pods -l  component=frr-k8s-webhook-server -o wide -w
NAME                                      READY   STATUS             RESTARTS     AGE   IP           NODE          NOMINATED NODE   READINESS GATES
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     CrashLoopBackOff   2 (6s ago)   29s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Running            3 (22s ago)   45s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   1/1     Running            3 (38s ago)   61s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   1/1     Terminating        3 (69s ago)   92s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Terminating        3 (70s ago)   93s   <none>       kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Terminating        3 (70s ago)   93s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Terminating        3 (70s ago)   93s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Terminating        3 (70s ago)   93s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Pending            0             0s    <none>       <none>        <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Pending            0             0s    <none>       kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     ContainerCreating   0             0s    <none>       kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Running             0             1s    10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Completed           0             2s    10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Running             1 (2s ago)    3s    10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Error               1 (4s ago)    5s    10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     CrashLoopBackOff    1 (6s ago)    10s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Running             2 (20s ago)   24s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Completed           2 (21s ago)   25s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     CrashLoopBackOff    2 (2s ago)    26s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Running             3 (33s ago)   57s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   1/1     Running             3 (46s ago)   70s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   1/1     Terminating         3 (78s ago)   102s   10.244.1.5   kind-worker2   <none>           <none>
@DanielOsypenko
Copy link

DanielOsypenko commented Nov 5, 2024

with latest 4.16 Metallb we have an ImagePullBackOff on controller, speakerandfrr-k8s` pods pods are partially deployed with 1/2, 4/6 containers ready. It might be a related issue, but the outcomes are worse.

oc get csv
NAME                                         DISPLAY                          VERSION               REPLACES                                     PHASE
ingress-node-firewall.v4.16.0-202409051837   Ingress Node Firewall Operator   4.16.0-202409051837   ingress-node-firewall.v4.16.0-202410011135   Succeeded
metallb-operator.v4.16.0-202410292005        MetalLB Operator                 4.16.0-202410292005   metallb-operator.v4.16.0-202410251707        Succeeded 

webhook pod shows up TLS handshake errors in logs

(*runnableGroup).reconcile.func1\n\t/metallb/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223"}
2024/10/30 05:54:02 http: TLS handshake error from 10.130.0.41:48190: remote error: tls: bad certificate
2024/10/30 05:54:03 http: TLS handshake error from 10.130.0.41:48200: remote error: tls: bad certificate
2024/10/30 05:54:05 http: TLS handshake error from 10.130.0.41:48206: remote error: tls: bad certificate
2024/10/30 05:54:05 http: TLS handshake error from 10.130.0.41:48208: remote error: tls: bad certificate
2024/10/30 05:54:06 http: TLS handshake error from 10.130.0.41:48218: remote error: tls: bad certificate
2024/10/30 05:54:08 http: TLS handshake error from 10.130.0.41:58904: remote error: tls: bad certificate
2024/10/30 05:54:08 http: TLS handshake error from 10.130.0.41:58916: remote error: tls: bad certificate
2024/10/30 05:54:09 http: TLS handshake error from 10.130.0.41:58918: remote error: tls: bad certificate
2024/10/30 05:54:11 http: TLS handshake error from 10.130.0.41:58928: remote error: tls: bad certificate
2024/10/30 05:54:14 http: TLS handshake error from 10.130.0.41:58940: remote error: tls: bad certificate
2024/10/30 05:54:15 http: TLS handshake error from 10.130.0.41:58954: remote error: tls: bad certificate
2024/10/30 05:54:17 http: TLS handshake error from 10.130.0.41:58964: remote error: tls: bad certificate
2024/10/30 05:54:17 http: TLS handshake error from 10.130.0.41:58978: remote error: tls: bad certificate
2024/10/30 05:54:18 http: TLS handshake error from 10.130.0.41:44000: remote error: tls: bad certificate
2024/10/30 05:54:20 http: TLS handshake error from 10.130.0.41:44014: remote error: tls: bad certificate
2024/10/30 05:54:23 http: TLS handshake error from 10.130.0.41:44024: remote error: tls: bad certificate
2024/10/30 05:54:24 http: TLS handshake error from 10.130.0.41:44038: remote error: tls: bad certificate
2024/10/30 05:54:26 http: TLS handshake error from 10.130.0.41:44052: remote error: tls: bad certificate

Hosting cluster lacks these services:

frr-k8s-monitor-service 
frr-k8s-webhook-service 

Hosted kubevirt clusters fail to pull images and deploy operators showing up DeadlineExceeded error


Also another cluster that uses latest 4.17 version, has the same ImagePullBackOff errors on controller, speaker and frr-k8s but it seems to be working as expected.

 cc get csv
NAME                                         DISPLAY                          VERSION               REPLACES                                     PHASE
ingress-node-firewall.v4.17.0-202410011205   Ingress Node Firewall Operator   4.17.0-202410011205   ingress-node-firewall.v4.17.0-202410211206   Succeeded
metallb-operator.v4.17.0-202410241236        MetalLB Operator                 4.17.0-202410241236   

@fedepaol
Copy link
Member

fedepaol commented Nov 5, 2024

@DanielOsypenko there's no 4.16 version. This is the community version of the operator. If this is happening on openshift I suggest following up on Red Hat channels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants