[Bug]: Kafka pods are crashing and zookeeper reports unresolved address exception after machine restarts #10181
Replies: 4 comments 5 replies
-
I have incorrectly tagged this issue as Bug, please can this be changed to Question ? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Issue with 3 nodes is not reproducible but yes issue exists when we have single zoookeeper instance running on single / two node machine. So can this issue be investigated for single replica of zookeeper? |
Beta Was this translation helpful? Give feedback.
-
Looks like this could be helpful: apache/zookeeper#2040. I think this is not included in the ZK versions Strimzi runs? |
Beta Was this translation helpful? Give feedback.
-
Bug Description
It has been observed that after a machine/VM re-start the Kafka pods are crashing and zookeeper pod reports an Unresolved address exception.
This is occuring after we moved from Kafka version 3.4.0 to 3.7.0 which is supported with Strimzi Operator version 0.40(0.40.0-kafka-3.7.0). In the Strimzi-operator log we can see Session lost/Expired exception.
This issue is not very consistent it happens say 6/10 times after a machine re-start but we have seen this issue only after moving kafka from 3.4.0 to a higher version and we had to do this upgrade as stimzi operator 0.40.0-kafka-3.7.0 doesn't supports using previous Kafka version 3.4.0. As this issue is more prominent with newer version of Strimzi/Kafka, please can this be looked upon ?
The workaround we used was to re-start zookeeper pods and then kafka pods(if they are not up automatically). We had to re-start zookeeper pod multiple times. Another workaround was to uninstall and re-install kafka.
Steps to reproduce
Expected behavior
Kafka pods should not crash and zookeeper should not report unresolved address exception after a machine/VM re-start
Strimzi version
0.40
Kubernetes version
v1.29.1+rke2r1
Installation method
Helm Chart
Infrastructure
AWS EC2
Configuration files and logs
zookeeper-log.txt
strimzi-operator.txt
kafka-log.txt
kafka-entity-operator.txt
kafka-kafka-exporter.txt
Additional context
Zookeeper Exception:
2024-05-31 06:22:46,870 INFO Created server with tickTime 500 ms minSessionTimeout 1000 ms maxSessionTimeout 10000 ms clientPortListenBacklog -1 datadir /var/lib/zookeeper/data/version-2 snapdir /var/lib/zookeeper/data/version-2 (org.apache.zookeeper.server.ZooKeeperServer) [QuorumPeermyid=1(secure=[0:0:0:0:0:0:0:0]:2181)]
2024-05-31 06:22:46,870 ERROR Couldn't bind to kafka-cluster-zookeeper-0.kafka-cluster-zookeeper-nodes.foundation-env-default.svc/:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeermyid=1(secure=[0:0:0:0:0:0:0:0]:2181)]
java.net.SocketException: Unresolved address
at java.base/java.net.ServerSocket.bind(ServerSocket.java:380)
at java.base/java.net.ServerSocket.bind(ServerSocket.java:342)
at org.apache.zookeeper.server.quorum.Leader.createServerSocket(Leader.java:322)
at org.apache.zookeeper.server.quorum.Leader.lambda$new$0(Leader.java:301)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3573)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
at org.apache.zookeeper.server.quorum.Leader.(Leader.java:304)
at org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1340)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1551)
2024-05-31 06:22:46,870 WARN Unexpected exception (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeermyid=1(secure=[0:0:0:0:0:0:0:0]:2181)]
java.io.IOException: Leader failed to initialize any of the following sockets: [kafka-cluster-zookeeper-0.kafka-cluster-zookeeper-nodes.foundation-env-default.svc/:2888]
at org.apache.zookeeper.server.quorum.Leader.(Leader.java:307)
at org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1340)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1551)
2024-05-31 06:22:46,870 INFO Peer state changed: looking (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeermyid=1(secure=[0:0:0:0:0:0:0:0]:2181)]
Beta Was this translation helpful? Give feedback.
All reactions