Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system test] make Recovery tests run only on KRaft mode only #10637

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

see-quick
Copy link
Member

@see-quick see-quick commented Sep 24, 2024

Type of change

  • Enhancement / new feature
  • Refactoring

Description

This PR changes our recovery tests so they can run with KRaft mode.

Checklist

  • Write tests
  • Make sure all tests pass

@see-quick see-quick added this to the 0.44.0 milestone Sep 24, 2024
@see-quick see-quick requested a review from a team September 24, 2024 14:33
@see-quick see-quick self-assigned this Sep 24, 2024
@scholzj
Copy link
Member

scholzj commented Sep 24, 2024

Why would you want to run them in ZooKeeper mode only?

@see-quick
Copy link
Member Author

Why would you want to run them in ZooKeeper mode only?

Well, some of the test cases do not work with KRaft here (e.g., testTopicNotAvailable, testTopicAvailable and more...). I don't have any strong opinion on this but for now, I think having these tests run only at least ZK mode is okay I think. Maybe @henryZrncik or @im-konge knows more about this.

@scholzj
Copy link
Member

scholzj commented Sep 25, 2024

Why would you want to run them in ZooKeeper mode only?

Well, some of the test cases do not work with KRaft here (e.g., testTopicNotAvailable, testTopicAvailable and more...). I don't have any strong opinion on this but for now, I think having these tests run only at least ZK mode is okay I think. Maybe @henryZrncik or @im-konge knows more about this.

So, should it be fixed? Extended? Deleted? ZooKeeper will be gone soon and any ZooKeeper only tests will be deleted with it.

@see-quick
Copy link
Member Author

Why would you want to run them in ZooKeeper mode only?

Well, some of the test cases do not work with KRaft here (e.g., testTopicNotAvailable, testTopicAvailable and more...). I don't have any strong opinion on this but for now, I think having these tests run only at least ZK mode is okay I think. Maybe @henryZrncik or @im-konge knows more about this.

So, should it be fixed? Extended? Deleted? ZooKeeper will be gone soon and any ZooKeeper only tests will be deleted with it.

I mean if those tests could run also on KRaft maybe we should update them...I think @henryZrncik and @fvaleri were trying to somehow fix it but I am not sure if that's possible (it was something related to UTO).

@scholzj
Copy link
Member

scholzj commented Sep 25, 2024

Why would you want to run them in ZooKeeper mode only?

Well, some of the test cases do not work with KRaft here (e.g., testTopicNotAvailable, testTopicAvailable and more...). I don't have any strong opinion on this but for now, I think having these tests run only at least ZK mode is okay I think. Maybe @henryZrncik or @im-konge knows more about this.

So, should it be fixed? Extended? Deleted? ZooKeeper will be gone soon and any ZooKeeper only tests will be deleted with it.

I mean if those tests could run also on KRaft maybe we should update them...I think @henryZrncik and @fvaleri were trying to somehow fix it but I am not sure if that's possible (it was something related to UTO).

Right, but we need to understand what exactly the problem is. UTO is now used everywhere. So not sure why would that make a difference Zoo versus KRaft.

Copy link
Contributor

@henryZrncik henryZrncik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for PR!

IMHO Changes in the RecoveryST would need a few more steps to run correctly.

Regarding NamespaceDeletionRecovery as this tests represent steps done in documentation it would be nice to keep these tests (referencing to removing zookeeper related tests) it would be probably nice to find out what is the cause of problem in kraft after you resolved problem with incorrect zookeeper configuration.

@@ -71,7 +66,6 @@ void testRecoveryFromKafkaStrimziPodSetDeletion() {
}

@IsolatedTest("We need for each test case its own Cluster Operator")
@KRaftNotSupported("Zookeeper is not supported by KRaft mode and is used in this test class")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding allowing tests in kraft:

  • tests with services do not make much sense in kraft as these services are no longer present.
  • test of pod sets would make sense but it would still fail as currently we target specifically zookeeper so this needs adjustment to target rather controller by selector. e.g. to target strimzi pod set for controller my-cluster-7-c-9aa820e3 instead of zookeeper it expects something like my-cluster-7-zookeeper.

Comment on lines 199 to 219
resourceManager.createResourceWithWait(
NodePoolsConverter.convertNodePoolsIfNeeded(
KafkaNodePoolTemplates.brokerPoolPersistentStorage(testStorage.getNamespaceName(), testStorage.getBrokerPoolName(), testStorage.getClusterName(), 3)
.editSpec()
.withNewPersistentClaimStorage()
.withSize("1Gi")
.withStorageClass(storageClassName)
.endPersistentClaimStorage()
.endSpec()
.build(),
KafkaNodePoolTemplates.controllerPoolPersistentStorage(testStorage.getNamespaceName(), testStorage.getControllerPoolName(), testStorage.getClusterName(), 3)
.editSpec()
.withNewPersistentClaimStorage()
.withSize("1Gi")
.withStorageClass(storageClassName)
.endPersistentClaimStorage()
.endSpec()
.build()
)
);
resourceManager.createResourceWithWait(KafkaTemplates.kafkaPersistent(testStorage.getNamespaceName(), testStorage.getClusterName(), 3, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test testTopicNotAvailable actually fails with kraft on a step of recreating Kafka Cluster after namespace is deleted. All to be recreated pods get into crash-loop as they try to connect to incorrect cluster.

For example:

Exception in thread “main” java.lang.RuntimeException: Invalid cluster.id in: /var/lib/kafka/data/kafka-log1/meta.properties. Expected 28c5tB6tSlCjEHep6l3Jww, but read TRrzEInZRgCs_R9iNz2Gkw at org.apache.kafka.metadata.properties.MetaPropertiesEnsemble.verify(MetaPropertiesEnsemble.java:509) 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a bug to me @scholzj @fvaleri ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@see-quick finally I was able to have a look at this.

  1. deploy the Kafka cluster without Topic Operator - otherwise topics will be deleted

This is not needed anymore with the new Topic Operator, so it can be removed from the code.


To make this test work on my local Minikube I had to update the StorageClass.VolumeBindingMode from WaitForFirstConsumer to Immediate. Is there any Minikube configuration you are doing here? Is this documented somewhere?


I was able to reproduce the issue mentioned by @henryZrncik:

Exception in thread "main" java.lang.RuntimeException: Invalid cluster.id in: /var/lib/kafka/data/kafka-log0/meta.properties. Expected kQHv733NQIew9aw9uCXnDA, but read 2DLef4_8TqOVdBjzCORB1Q

I also think this is a bug, so I raised #10722. The attached fix should work with this test. Hope it helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a bug, it means that this was not set when recovering the Kafka CR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the test when I find some time thanks 💯

@see-quick
Copy link
Member Author

Thanks for PR!

IMHO Changes in the RecoveryST would need a few more steps to run correctly.

Regarding NamespaceDeletionRecovery as this tests represent steps done in documentation it would be nice to keep these tests (referencing to removing zookeeper related tests) it would be probably nice to find out what is the cause of problem in kraft after you resolved problem with incorrect zookeeper configuration.

Thanks for investigating this problem. So let's do the following:

  1. Remove ZK-only tests, which can't be run as KRAft I think we have 2 of them (i.e., testRecoveryFromZookeeperServiceDeletion and testRecoveryFromZookeeperHeadlessServiceDeletion).

And in the case of testTopicNotAvailable, I am unsure if that's a possible bug in KRaft? 🤔 I will double-check this one.

So in summary I think making those tests KRaft-only compatible would be better than just ZK (because of the mentioned issue that ZK will be removed soon).

@see-quick see-quick changed the title [system test] make Recovery tests run only on ZK mode only [system test] make Recovery tests run only on KRaft mode only Oct 11, 2024
@scholzj scholzj modified the milestones: 0.44.0, 0.45.0 Oct 21, 2024
@see-quick
Copy link
Member Author

I have updated the tests to match the recovery procedure. The tests are passing now with such a change.

@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.17 --install-type=bundle --profile=recovery

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --install-type=bundle --profile=recovery

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

@strimzi-ci
Copy link

✔️ Test Summary ✔️

TEST_PROFILE: recovery
GROUPS:
TEST_CASE:
TOTAL: 2
PASS: 2
FAIL: 0
SKIP: 0
BUILD_NUMBER: 7
OCP_VERSION: 4.17
BUILD_IMAGES: false
FIPS_ENABLED: false
PARALLEL_COUNT: 5
EXCLUDED_GROUPS: loadbalancer,nodeport,olm

Copy link
Contributor

@henryZrncik henryZrncik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking on this. Just one change is needed i guess ?

Signed-off-by: see-quick <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants