Kafka operator fails to create new Kafka and ZK pods #8516

bderrly · 2023-05-15T22:17:47Z

bderrly
May 15, 2023

In the past few weeks we have noticed a worrying situation wherein the strimzi-operator does not create a new pod to replace a missing pod. This just now caused a crash of both zookeeper and, as a result kafka, in our staging cluster. If we delete the strimzi-operator pod the cluster is quickly brought back to health by creating the missing pods. In this case, the missing pods was all zookeeper pods (three) and all kafka broker pods (three).

We recently upgraded from 0.27.1 to 0.34.0 and did not explicitly disable the new StrimziPodSet feature in the upgrade process. I am curious if this new feature may be playing a part. With that said, I have found precious little in the logs to indicate the failure. As you can see below the general flow is that a reconciliation begins, a pod is restarted (in this case due to a k8s node pool upgrade), but it never schedules.

A listing of pods would result in a pod missing from the list rather than a pod listed as attempting to be scheduled.

Log snippet follows:

2023-05-15 17:27:47 INFO  AbstractOperator:239 - Reconciliation #2094(timer) Kafka(humio/humio): Kafka humio will be checked for creation or modification
2023-05-15 17:27:48 INFO  ZookeeperLeaderFinder:220 - Reconciliation #2094(timer) Kafka(humio/humio): Pod humio-zookeeper-2 is not a leader
2023-05-15 17:27:48 INFO  ZookeeperLeaderFinder:220 - Reconciliation #2094(timer) Kafka(humio/humio): Pod humio-zookeeper-0 is not a leader
2023-05-15 17:27:48 INFO  ZookeeperLeaderFinder:217 - Reconciliation #2094(timer) Kafka(humio/humio): Pod humio-zookeeper-1 is leader
2023-05-15 17:27:48 INFO  ZooKeeperRoller:142 - Reconciliation #2094(timer) Kafka(humio/humio): Rolling pod humio-zookeeper-0 due to [manual rolling update annotation on a pod]
2023-05-15 17:27:48 INFO  PodOperator:54 - Reconciliation #2094(timer) Kafka(humio/humio): Rolling pod humio-zookeeper-0
2023-05-15 17:28:47 INFO  AbstractOperator:380 - Reconciliation #2094(timer) Kafka(humio/humio): Reconciliation is in progress
2023-05-15 17:29:47 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace kafka-operator
2023-05-15 17:29:47 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace humio
2023-05-15 17:29:47 INFO  AbstractOperator:380 - Reconciliation #2094(timer) Kafka(humio/humio): Reconciliation is in progress
2023-05-15 17:30:47 INFO  AbstractOperator:380 - Reconciliation #2094(timer) Kafka(humio/humio): Reconciliation is in progress
2023-05-15 17:31:47 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace kafka-operator
2023-05-15 17:31:47 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace humio
2023-05-15 17:31:47 INFO  AbstractOperator:380 - Reconciliation #2094(timer) Kafka(humio/humio): Reconciliation is in progress
2023-05-15 17:32:47 INFO  AbstractOperator:380 - Reconciliation #2094(timer) Kafka(humio/humio): Reconciliation is in progress
2023-05-15 17:32:51 ERROR Util:166 - Reconciliation #2094(timer) Kafka(humio/humio): Exceeded timeout of 300000ms while waiting for Pods resource humio-zookeeper-0 in namespace humio to be ready
2023-05-15 17:32:51 ERROR AbstractOperator:260 - Reconciliation #2094(timer) Kafka(humio/humio): createOrUpdate failed
io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource humio-zookeeper-0 in namespace humio to be ready
	at io.strimzi.operator.common.Util$1.lambda$handle$1(Util.java:167) ~[io.strimzi.operator-common-0.34.0.jar:0.34.0]
	at io.vertx.core.impl.future.FutureImpl$3.onFailure(FutureImpl.java:153) ~[io.vertx.vertx-core-4.3.8.jar:4.3.8]
	at io.vertx.core.impl.future.FutureBase.lambda$emitFailure$1(FutureBase.java:69) ~[io.vertx.vertx-core-4.3.8.jar:4.3.8]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[io.netty.netty-transport-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]
2023-05-15 17:32:52 INFO  CrdOperator:133 - Reconciliation #2094(timer) Kafka(humio/humio): Status of Kafka humio in namespace humio has been updated
2023-05-15 17:32:52 WARN  AbstractOperator:525 - Reconciliation #2094(timer) Kafka(humio/humio): Failed to reconcile io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource humio-zookeeper-0 in namespace humio to be ready
	at io.strimzi.operator.common.Util$1.lambda$handle$1(Util.java:167) ~[io.strimzi.operator-common-0.34.0.jar:0.34.0]
	at io.vertx.core.impl.future.FutureImpl$3.onFailure(FutureImpl.java:153) ~[io.vertx.vertx-core-4.3.8.jar:4.3.8]
	at io.vertx.core.impl.future.FutureBase.lambda$emitFailure$1(FutureBase.java:69) ~[io.vertx.vertx-core-4.3.8.jar:4.3.8]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[io.netty.netty-transport-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

bderrly · 2023-05-15T23:00:12Z

bderrly
May 15, 2023
Author

With further investigation I can see that there are zero POST requests to the Kubernetes API server until the strimzi-operator has been deleted and rescheduled. Immediately after it starts up there are quickly six POST requests (three for kafka, three for zookeeper) and the pods are soon scheduled and running.

0 replies

scholzj · 2023-05-15T23:01:31Z

scholzj
May 15, 2023
Maintainer

Some other users who provided full logs seemed to have the issue that the informer informing the clients about the Pod events died. It is not clear if that is your case without full logs etc. But it sounds the same. You should try to upgrade to 0.35 which has updated the Kubernetes client which had some bugs as well as improved the handling around it.

5 replies

bderrly May 15, 2023
Author

Were there some other issues or discussion threads about this? I searched but I didn't find anything that looked the same.

scholzj May 15, 2023
Maintainer

I think there were some discussions here as well as on Slack.

celvin Oct 16, 2024

2024 this still happening :'(

Triggering periodic reconciliation for namespace strimzi
2024-10-16 02:06:15 INFO AbstractWatchManager:442 - Watch connection error recieved 34 times without progress, will reconnect if possible
java.net.SocketException: Connection reset

scholzj Oct 16, 2024
Maintainer

Yes, and I expect that connection and networking issues will be happening even in 2025 ... and probably even after that. You shared a completely generic error, which could mean almost anything.

celvin Oct 17, 2024

sorry for being this vague, I've done a clean installation of the operator, without installing anything else rather than the operator, and that error message starts to show at the next day or after some hours even though all the containers are running properly, then I install the debezium plugin and it doesn't work, but the message I pasted is just a little portion of the log message that was posted originally in this thread

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Kafka operator fails to create new Kafka and ZK pods #8516

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Strimzi

Kafka operator fails to create new Kafka and ZK pods #8516

bderrly May 15, 2023

Replies: 2 comments · 5 replies

bderrly May 15, 2023 Author

scholzj May 15, 2023 Maintainer

bderrly May 15, 2023 Author

scholzj May 15, 2023 Maintainer

celvin Oct 16, 2024

scholzj Oct 16, 2024 Maintainer

celvin Oct 17, 2024

bderrly
May 15, 2023

Replies: 2 comments 5 replies

bderrly
May 15, 2023
Author

scholzj
May 15, 2023
Maintainer

bderrly May 15, 2023
Author

scholzj May 15, 2023
Maintainer

scholzj Oct 16, 2024
Maintainer