You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On 2024-11-28 at 02:33 UTC Uptime Robot triggered our storage-hosted monitorings: Library, mirrorbrain, ZF drive, master-mirror, openzim-dl, offspot-drive.
All services came back between 6 and 12mn later.
Uptime Robot says it got Connection Timeout which can be several things. Here's a recap from the logs
At 02:29:57, nginx-ingress received its last workin UptimeRobot request
At 02:30:51, nginx-ingress on storage could not contact Control Plane (client connection lost)
At 02:30:53, it reported no route to CP host.
At 02:32:04, kube-proxy reported NodeNotReady.
At 02:32:08, services that nginx talk to reported missing endpoints for all pods.
At 02:37:12, k8s started requesting Pod deletion via TaintManagerEviction
At 02:43:36, k8s receives event NodeReady from storage
At 02:43:37 k8s starts scheduling, pulling, creating, staring all pods
At 02:44:xx last requests to library-backend failing (connection refused) indicating pod was there but kiwix-serve not ready (resulted in a 502)
At 02:44:18, nginx-ingress was still unaware of library-frontend being up
At 02:44:19 mirrorbrain and library were serving content OK
other services to come up as well
1732760997366 216.144.248.24 - - [28/Nov/2024:02:29:57 +0000] "HEAD /catalog/v2/entries?count=1 HTTP/1.1" 200 0 "https://library.kiwix.org/catalog/v2/entries?count=1" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 418 0.001 [zim-library-frontend-service-http] [] 100.64.0.142:80 0 0.001 200 3b59f5a660c399e75ba97c4512fca687
...
1732761051568 E1128 02:30:51.568227 7 leaderelection.go:347] error retrieving resource lock ingress-nginx/ingress-nginx-leader: Get "https://10.32.0.1:443/apis/coordination.k8s.io/v1/namespaces/ingress-nginx/leases/ingress-nginx-leader": http2: client connection lost
...
1732761053197 E1128 02:30:53.197078 7 queue.go:131] "requeuing" err="Get \"https://10.32.0.1:443/api/v1/namespaces/ingress-nginx/services/ingress-nginx-controller\": dial tcp 10.32.0.1:443: connect: no route to host"
...
1732761128060 W1128 02:32:08.059888 7 controller.go:1216] Service "zimfarm/drive-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.059931 7 controller.go:1216] Service "offspot/drive-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.059960 7 controller.go:1216] Service "zim/mirror-download-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.059976 7 controller.go:1216] Service "zim/download-openzim-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.059990 7 controller.go:1216] Service "zim/tmp-kiwix-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060002 7 controller.go:1216] Service "zim/mirrorbrain-web-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060335 7 controller.go:1216] Service "metrics/metrics-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060364 7 controller.go:1216] Service "zim/dev-drive-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060381 7 controller.go:1216] Service "zim/dev-library-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060406 7 controller.go:1216] Service "zim/library-frontend-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060521 7 controller.go:1216] Service "zim/demo-library-webcache-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060666 7 controller.go:1216] Service "zimit/cm-acme-http-solver-5flmn" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060766 7 controller.go:1216] Service "zimit/cm-acme-http-solver-xh2nl" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060648 7 controller.go:1216] Service "zimfarm/drive-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060722 7 controller.go:1216] Service "offspot/drive-service" does not have any active Endpoint.
1732761128060 W1128 02:32:08.060796 7 controller.go:1216] Service "zim/mirror-download-service" does not have any active Endpoint.
1732761128061 W1128 02:32:08.060848 7 controller.go:1216] Service "zim/download-openzim-service" does not have any active Endpoint.
1732761128061 W1128 02:32:08.060904 7 controller.go:1216] Service "zim/tmp-kiwix-service" does not have any active Endpoint.
1732761128061 W1128 02:32:08.060944 7 controller.go:1216] Service "zim/mirrorbrain-web-service" does not have any active Endpoint.
1732761128061 W1128 02:32:08.061495 7 controller.go:1216] Service "metrics/metrics-service" does not have any active Endpoint.
1732761128061 W1128 02:32:08.061586 7 controller.go:1216] Service "zim/dev-drive-service" does not have any active Endpoint.
1732761128061 W1128 02:32:08.061650 7 controller.go:1216] Service "zim/dev-library-service" does not have any active Endpoint.
1732761128061 W1128 02:32:08.061708 7 controller.go:1216] Service "zim/library-frontend-service" does not have any active Endpoint.
1732761128062 W1128 02:32:08.061929 7 controller.go:1216] Service "zim/demo-library-webcache-service" does not have any active Endpoint.
1732761128062 W1128 02:32:08.062158 7 controller.go:1216] Service "zimit/cm-acme-http-solver-5flmn" does not have any active Endpoint.
1732761128062 W1128 02:32:08.062245 7 controller.go:1216] Service "zimit/cm-acme-http-solver-xh2nl" does not have any active Endpoint.
...
1732761440268 E1128 02:37:20.268602 7 leaderelection.go:347] error retrieving resource lock ingress-nginx/ingress-nginx-leader: Get "https://10.32.0.1:443/apis/coordination.k8s.io/v1/namespaces/ingress-nginx/leases/ingress-nginx-leader": dial tcp 10.32.0.1:443: connect: no route to host
...
1732761816000 name=scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a kind=Node eventRV=2063657 reportinginstance=scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a reportingcontroller=kubelet sourcecomponent=kubelet sourcehost=scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a reason=NodeReady type=Normal count=1 msg="Node scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a status is now: NodeReady"
...
1732761817070 name=library-frontend-deployment-85f4d99874-ndw67 kind=Pod action=Binding objectAPIversion=v1 objectRV=24923410571 eventRV=2063527 reportinginstance=default-scheduler-control-plane-5f4f8f8cb9-6rfsz reportingcontroller=default-scheduler reason=Scheduled type=Normal msg="Successfully assigned zim/library-frontend-deployment-85f4d99874-ndw67 to scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a"
...
1732761818000 name=library-frontend-deployment-85f4d99874-ndw67 kind=Pod objectAPIversion=v1 eventRV=2063561 reportingcontroller=taint-eviction-controller sourcecomponent=taint-eviction-controller reason=TaintManagerEviction type=Normal count=1 msg="Cancelling deletion of Pod zim/library-frontend-deployment-85f4d99874-ndw67"
...
1732761852139 2024/11/28 02:44:12 [error] 105742#105742: *527129157 connect() failed (111: Connection refused) while connecting to upstream, client: 111.119.213.240, server: library.kiwix.org, request: "GET /content/stackoverflow.com_en_all_nopic_2022-07/questions/37101359/how-to-create-task-in-loop-and-wait-all-task-complete HTTP/2.0", upstream: "http://100.64.0.212:80/content/stackoverflow.com_en_all_nopic_2022-07/questions/37101359/how-to-create-task-in-loop-and-wait-all-task-complete", host: "library.kiwix.org"
1732761852139 111.119.213.240 - - [28/Nov/2024:02:44:12 +0000] "GET /content/stackoverflow.com_en_all_nopic_2022-07/questions/37101359/how-to-create-task-in-loop-and-wait-all-task-complete HTTP/2.0" 502 552 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" 417 0.000 [zim-library-frontend-service-http] [] 100.64.0.212:80, 100.64.0.212:80, 100.64.0.212:80 0, 0, 0 0.000, 0.000, 0.000 502, 502, 502 da4c7c2614100b90147df0d40a0cc08d
...
W1128 02:44:18.289015 7 controller.go:1216] Service "zim/library-frontend-service" does not have any active Endpoint.
...
1732761859695 138.207.227.14 - - [28/Nov/2024:02:44:19 +0000] "GET /release/kiwix-macos/kiwix-macos_3.5.2.dmg HTTP/1.1" 302 296 "-" "Homebrew/4.4.8 (Macintosh; arm64 Mac OS X 15.1.1) curl/8.7.1" 193 2.490 [zim-mirrorbrain-web-service-80] [] 100.64.0.199:80 296 2.490 302 fbeaa76c0a316fc7f58e3437680238a3
1732761859738 129.222.87.171 - - [28/Nov/2024:02:44:19 +0000] "GET /zim/zimit/tonedear.com_en_2024-09.zim HTTP/1.1" 302 292 "-" "Wget/1.21" 218 1.253 [zim-mirrorbrain-web-service-80] [] 100.64.0.199:80 292 1.253 302 b6a83185563a1bc2a1854dd9f5708047
1732761859772 111.119.236.224 - - [28/Nov/2024:02:44:19 +0000] "GET /content/stackoverflow.com_en_bard_2021-11/questions/10235011/how-are-interrupts-handled-on-smp HTTP/2.0" 200 9209 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" 403 1.815 [zim-dev-library-service-80] [] 100.64.0.214:80 9209 1.815 200 b47dbccdea3dd3eeb2484bc982e70e0e
CPU and memory of the node
All pods started at the same time
Hetzner traffic graph
Based on this, my initial hypothesis is that there's been a network issue at Hetzner. We do are in the network maintenance window but this doesn't look like maintenance.
Anyway, a network issue (maybe just for the route to Scaleway), then the node cannot contact the CP so it terminates its pod. Traffic goes down because we dont reply. Then it can contact the CP again and restarts all the pods and traffic comes back.
This needs to be debated of course but if confirmed, a mitigation action could be to prevent the node.kubernetes.io/not-ready taint from deleting pods (TaintManagerEviction feature apparently).
The text was updated successfully, but these errors were encountered:
It happened again from 9:05PM (still on 28-11). But this time, none of the pods restarted (probably only because connectivity loss was shorter / it was more connectivity instability than total loss).
Luckily I was behind my keyboard hacking some personal stuff, so I briefly connected to observe a bit, and this is what I observed:
SSH connection to storage machine was possible but slow / laggy
even ping was instable (not always responding)
when requesting a web property in my browser (e.g. https://library.kiwix.org), very often the nginx ingress was responding but returning a 503 (meaning it had no idea where the pod was located)
we had few jobs (e.g. dev library generation) which were pending because the node was tainted
I had same difficulties to connect to my own personal machine which is also at Hetzner Helsinki (but different DC)
I did not tried to connect to mirrors-qa machine (could have been useful to confirm where connectivity issue lies)
That been said, we should be more resilient to this kind of loss of connectivity, especially given our very simple architecture (it is not like the pods which are not running on one node could be restarted by the CP on another node, which is the main reason for current behavior).
In addition to prevent the taint from deleting pods (which I agree is an interesting mitigation), I think we should also take an action so that nginx ingress is not stopping to serve traffic when CP is not reachable. In our architecture, the fact that connectivity between kubelet and CP is lost should not prevent normal node operation. I found no way so far to make this work with "standard" k8s, which assumes very good connectivity between CP and nodes, and favor moving workloads to new nodes when connectivity between CP and one node is lost. I'm pretty sure something less centralized like a k3s distro would do a better job here. Disclaimer: I've no strong expertise with k3s at all, probably something to investigate gradually. And there are maybe other better solutions, this is just a "popular" one.
On 2024-11-28 at 02:33 UTC Uptime Robot triggered our storage-hosted monitorings: Library, mirrorbrain, ZF drive, master-mirror, openzim-dl, offspot-drive.
All services came back between 6 and 12mn later.
Uptime Robot says it got
Connection Timeout
which can be several things. Here's a recap from the logsNodeNotReady
.CPU and memory of the node
All pods started at the same time
Hetzner traffic graph
Based on this, my initial hypothesis is that there's been a network issue at Hetzner. We do are in the network maintenance window but this doesn't look like maintenance.
Anyway, a network issue (maybe just for the route to Scaleway), then the node cannot contact the CP so it terminates its pod. Traffic goes down because we dont reply. Then it can contact the CP again and restarts all the pods and traffic comes back.
This needs to be debated of course but if confirmed, a mitigation action could be to prevent the
node.kubernetes.io/not-ready
taint from deleting pods (TaintManagerEviction
feature apparently).The text was updated successfully, but these errors were encountered: