2024-11-28 Storage outage (12mn) #314

rgaudin · 2024-11-28T11:17:29Z

On 2024-11-28 at 02:33 UTC Uptime Robot triggered our storage-hosted monitorings: Library, mirrorbrain, ZF drive, master-mirror, openzim-dl, offspot-drive.

All services came back between 6 and 12mn later.

Uptime Robot says it got Connection Timeout which can be several things. Here's a recap from the logs

At 02:29:57, nginx-ingress received its last workin UptimeRobot request
At 02:30:51, nginx-ingress on storage could not contact Control Plane (client connection lost)
At 02:30:53, it reported no route to CP host.
At 02:32:04, kube-proxy reported NodeNotReady.
At 02:32:08, services that nginx talk to reported missing endpoints for all pods.
At 02:37:12, k8s started requesting Pod deletion via TaintManagerEviction
At 02:43:36, k8s receives event NodeReady from storage
At 02:43:37 k8s starts scheduling, pulling, creating, staring all pods
At 02:44:xx last requests to library-backend failing (connection refused) indicating pod was there but kiwix-serve not ready (resulted in a 502)
At 02:44:18, nginx-ingress was still unaware of library-frontend being up
At 02:44:19 mirrorbrain and library were serving content OK
other services to come up as well

1732760997366	216.144.248.24 - - [28/Nov/2024:02:29:57 +0000] "HEAD /catalog/v2/entries?count=1 HTTP/1.1" 200 0 "https://library.kiwix.org/catalog/v2/entries?count=1" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 418 0.001 [zim-library-frontend-service-http] [] 100.64.0.142:80 0 0.001 200 3b59f5a660c399e75ba97c4512fca687
...
1732761051568	E1128 02:30:51.568227       7 leaderelection.go:347] error retrieving resource lock ingress-nginx/ingress-nginx-leader: Get "https://10.32.0.1:443/apis/coordination.k8s.io/v1/namespaces/ingress-nginx/leases/ingress-nginx-leader": http2: client connection lost
...
1732761053197	E1128 02:30:53.197078       7 queue.go:131] "requeuing" err="Get \"https://10.32.0.1:443/api/v1/namespaces/ingress-nginx/services/ingress-nginx-controller\": dial tcp 10.32.0.1:443: connect: no route to host"
...
1732761128060	W1128 02:32:08.059888       7 controller.go:1216] Service "zimfarm/drive-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.059931       7 controller.go:1216] Service "offspot/drive-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.059960       7 controller.go:1216] Service "zim/mirror-download-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.059976       7 controller.go:1216] Service "zim/download-openzim-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.059990       7 controller.go:1216] Service "zim/tmp-kiwix-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060002       7 controller.go:1216] Service "zim/mirrorbrain-web-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060335       7 controller.go:1216] Service "metrics/metrics-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060364       7 controller.go:1216] Service "zim/dev-drive-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060381       7 controller.go:1216] Service "zim/dev-library-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060406       7 controller.go:1216] Service "zim/library-frontend-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060521       7 controller.go:1216] Service "zim/demo-library-webcache-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060666       7 controller.go:1216] Service "zimit/cm-acme-http-solver-5flmn" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060766       7 controller.go:1216] Service "zimit/cm-acme-http-solver-xh2nl" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060648       7 controller.go:1216] Service "zimfarm/drive-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060722       7 controller.go:1216] Service "offspot/drive-service" does not have any active Endpoint.
1732761128060	W1128 02:32:08.060796       7 controller.go:1216] Service "zim/mirror-download-service" does not have any active Endpoint.
1732761128061	W1128 02:32:08.060848       7 controller.go:1216] Service "zim/download-openzim-service" does not have any active Endpoint.
1732761128061	W1128 02:32:08.060904       7 controller.go:1216] Service "zim/tmp-kiwix-service" does not have any active Endpoint.
1732761128061	W1128 02:32:08.060944       7 controller.go:1216] Service "zim/mirrorbrain-web-service" does not have any active Endpoint.
1732761128061	W1128 02:32:08.061495       7 controller.go:1216] Service "metrics/metrics-service" does not have any active Endpoint.
1732761128061	W1128 02:32:08.061586       7 controller.go:1216] Service "zim/dev-drive-service" does not have any active Endpoint.
1732761128061	W1128 02:32:08.061650       7 controller.go:1216] Service "zim/dev-library-service" does not have any active Endpoint.
1732761128061	W1128 02:32:08.061708       7 controller.go:1216] Service "zim/library-frontend-service" does not have any active Endpoint.
1732761128062	W1128 02:32:08.061929       7 controller.go:1216] Service "zim/demo-library-webcache-service" does not have any active Endpoint.
1732761128062	W1128 02:32:08.062158       7 controller.go:1216] Service "zimit/cm-acme-http-solver-5flmn" does not have any active Endpoint.
1732761128062	W1128 02:32:08.062245       7 controller.go:1216] Service "zimit/cm-acme-http-solver-xh2nl" does not have any active Endpoint.
...
1732761440268	E1128 02:37:20.268602       7 leaderelection.go:347] error retrieving resource lock ingress-nginx/ingress-nginx-leader: Get "https://10.32.0.1:443/apis/coordination.k8s.io/v1/namespaces/ingress-nginx/leases/ingress-nginx-leader": dial tcp 10.32.0.1:443: connect: no route to host
...
1732761816000	name=scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a kind=Node eventRV=2063657 reportinginstance=scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a reportingcontroller=kubelet sourcecomponent=kubelet sourcehost=scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a reason=NodeReady type=Normal count=1 msg="Node scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a status is now: NodeReady" 
...
1732761817070	name=library-frontend-deployment-85f4d99874-ndw67 kind=Pod action=Binding objectAPIversion=v1 objectRV=24923410571 eventRV=2063527 reportinginstance=default-scheduler-control-plane-5f4f8f8cb9-6rfsz reportingcontroller=default-scheduler reason=Scheduled type=Normal msg="Successfully assigned zim/library-frontend-deployment-85f4d99874-ndw67 to scw-kiwix-prod-foreign-e61f0a58d1cf43adafde59a" 
...
1732761818000	name=library-frontend-deployment-85f4d99874-ndw67 kind=Pod objectAPIversion=v1 eventRV=2063561 reportingcontroller=taint-eviction-controller sourcecomponent=taint-eviction-controller reason=TaintManagerEviction type=Normal count=1 msg="Cancelling deletion of Pod zim/library-frontend-deployment-85f4d99874-ndw67" 
...
1732761852139	2024/11/28 02:44:12 [error] 105742#105742: *527129157 connect() failed (111: Connection refused) while connecting to upstream, client: 111.119.213.240, server: library.kiwix.org, request: "GET /content/stackoverflow.com_en_all_nopic_2022-07/questions/37101359/how-to-create-task-in-loop-and-wait-all-task-complete HTTP/2.0", upstream: "http://100.64.0.212:80/content/stackoverflow.com_en_all_nopic_2022-07/questions/37101359/how-to-create-task-in-loop-and-wait-all-task-complete", host: "library.kiwix.org"
1732761852139	111.119.213.240 - - [28/Nov/2024:02:44:12 +0000] "GET /content/stackoverflow.com_en_all_nopic_2022-07/questions/37101359/how-to-create-task-in-loop-and-wait-all-task-complete HTTP/2.0" 502 552 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" 417 0.000 [zim-library-frontend-service-http] [] 100.64.0.212:80, 100.64.0.212:80, 100.64.0.212:80 0, 0, 0 0.000, 0.000, 0.000 502, 502, 502 da4c7c2614100b90147df0d40a0cc08d
...
W1128 02:44:18.289015       7 controller.go:1216] Service "zim/library-frontend-service" does not have any active Endpoint.
...
1732761859695	138.207.227.14 - - [28/Nov/2024:02:44:19 +0000] "GET /release/kiwix-macos/kiwix-macos_3.5.2.dmg HTTP/1.1" 302 296 "-" "Homebrew/4.4.8 (Macintosh; arm64 Mac OS X 15.1.1) curl/8.7.1" 193 2.490 [zim-mirrorbrain-web-service-80] [] 100.64.0.199:80 296 2.490 302 fbeaa76c0a316fc7f58e3437680238a3
1732761859738	129.222.87.171 - - [28/Nov/2024:02:44:19 +0000] "GET /zim/zimit/tonedear.com_en_2024-09.zim HTTP/1.1" 302 292 "-" "Wget/1.21" 218 1.253 [zim-mirrorbrain-web-service-80] [] 100.64.0.199:80 292 1.253 302 b6a83185563a1bc2a1854dd9f5708047
1732761859772	111.119.236.224 - - [28/Nov/2024:02:44:19 +0000] "GET /content/stackoverflow.com_en_bard_2021-11/questions/10235011/how-are-interrupts-handled-on-smp HTTP/2.0" 200 9209 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" 403 1.815 [zim-dev-library-service-80] [] 100.64.0.214:80 9209 1.815 200 b47dbccdea3dd3eeb2484bc982e70e0e

CPU and memory of the node

All pods started at the same time

Hetzner traffic graph

Based on this, my initial hypothesis is that there's been a network issue at Hetzner. We do are in the network maintenance window but this doesn't look like maintenance.
Anyway, a network issue (maybe just for the route to Scaleway), then the node cannot contact the CP so it terminates its pod. Traffic goes down because we dont reply. Then it can contact the CP again and restarts all the pods and traffic comes back.

This needs to be debated of course but if confirmed, a mitigation action could be to prevent the node.kubernetes.io/not-ready taint from deleting pods (TaintManagerEviction feature apparently).

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-11-29T08:00:41Z

It happened again from 9:05PM (still on 28-11). But this time, none of the pods restarted (probably only because connectivity loss was shorter / it was more connectivity instability than total loss).

Luckily I was behind my keyboard hacking some personal stuff, so I briefly connected to observe a bit, and this is what I observed:

SSH connection to storage machine was possible but slow / laggy
even ping was instable (not always responding)
when requesting a web property in my browser (e.g. https://library.kiwix.org), very often the nginx ingress was responding but returning a 503 (meaning it had no idea where the pod was located)
we had few jobs (e.g. dev library generation) which were pending because the node was tainted
I had same difficulties to connect to my own personal machine which is also at Hetzner Helsinki (but different DC)
I did not tried to connect to mirrors-qa machine (could have been useful to confirm where connectivity issue lies)

I hence confirm the hypothesis of networking issues. To me, this is more related to https://status.hetzner.com/incident/ec8a2f28-e964-46cb-94fa-bb4629c19777 and all the international situation between Finland and Germany in general (see e.g. https://www.reuters.com/business/media-telecom/telecoms-cable-linking-finland-germany-likely-severed-owner-says-2024-11-18/).

That been said, we should be more resilient to this kind of loss of connectivity, especially given our very simple architecture (it is not like the pods which are not running on one node could be restarted by the CP on another node, which is the main reason for current behavior).

In addition to prevent the taint from deleting pods (which I agree is an interesting mitigation), I think we should also take an action so that nginx ingress is not stopping to serve traffic when CP is not reachable. In our architecture, the fact that connectivity between kubelet and CP is lost should not prevent normal node operation. I found no way so far to make this work with "standard" k8s, which assumes very good connectivity between CP and nodes, and favor moving workloads to new nodes when connectivity between CP and one node is lost. I'm pretty sure something less centralized like a k3s distro would do a better job here. Disclaimer: I've no strong expertise with k3s at all, probably something to investigate gradually. And there are maybe other better solutions, this is just a "popular" one.

rgaudin added the incident label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024-11-28 Storage outage (12mn) #314

2024-11-28 Storage outage (12mn) #314

rgaudin commented Nov 28, 2024

benoit74 commented Nov 29, 2024

2024-11-28 Storage outage (12mn) #314

2024-11-28 Storage outage (12mn) #314

Comments

rgaudin commented Nov 28, 2024

benoit74 commented Nov 29, 2024