Add stop_ongoing_execution flag to rebalance requests for full run #10703

tinaselenge · 2024-10-11T15:52:14Z

Type of change

Select the type of your PR

Bugfix

Description

This adds an option to set stop_ongoing_execution flag.
The new flag is set when sending rebalancing requests for full run to avoid "Cannot start a new execution while there is an ongoing execution" error. This error can happen even if the operator calls CC's endpoint /stop_proposal_execution before sending a new rebalance request. The reason for this is documented here as:

Note that Cruise Control does not wait for the ongoing batch to finish when it stops execution, i.e. the in-progress batch may still be running after Cruise Control stops the execution.

Here is the possible flow that causes this issue:

Currently KafkaRebalance CR for removing brokers is in Rebalancing state
User updates the CR with a refresh annotation and with a new set of brokers to remove
The operator sends a request to /stop_proposal_execution endpoint to stop the current rebalance operation
The request completes successfully, however there are still in-progress batch for the current balance operation in CC.
The operator sends a request for a new proposal for the updated set of brokers to remove.
The new proposal is ready, therefore the KafkaRebalance is in ProposalReady state
The operator sends a request to execute the removal of the updated set of brokers
This request fails because there are still in-progress batch of the previous rebalance operation.
KafkaRebalance is in NotReady state due to this failure.

The fix is to update step 7, to send the request with "stop_ongoing_execution" flag set to true, so that CC first stops the still in-progress batch of the previous rebalance operation before processing the new rebalance request. The flow would become:

The operator sends a request to execute the removal of the updated set of brokers with "stop_ongoing_execution" flag set to true
CC waits for the still in-progress batch of the previous rebalance operation to stop
CC processes the request to execute the removal of the updated set of brokers
The KafkaRebalance is in Ready.

There were no test cases for the existing flow where operator calls /stop_proposal_execution when KafkaRebalance is in Rebalancing state and then refresh annotation is applied. These missing tests cases are added , however they do not test the new flow because we cannot mock the internal state of CC to have still in-progress batch.

Closes #10571

Checklist

Please go through this checklist and make sure all applicable tasks have been done

Write tests
Make sure all tests pass
Update documentation
Check RBAC rights for Kubernetes / OpenShift roles
Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
Reference relevant issue(s) and close them after merging
Update CHANGELOG.md
Supply screenshots for visual changes, such as Grafana dashboards

ppatierno

Mostly good, I left one comment.
Other than that, could you please add one more test in the KafkaRebalanceStateMachineTest class please?

ppatierno · 2024-10-16T13:37:48Z

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java

@@ -862,7 +862,7 @@ private Future<MapAndStatus<ConfigMap, KafkaRebalanceStatus>> onProposalReady(Re
                    return configMapOperator.getAsync(kafkaRebalance.getMetadata().getNamespace(), kafkaRebalance.getMetadata().getName()).compose(loadmap -> Future.succeededFuture(new MapAndStatus<>(loadmap, buildRebalanceStatusFromPreviousStatus(kafkaRebalance.getStatus(), StatusUtils.validate(reconciliation, kafkaRebalance)))));
                case approve:
                    LOGGER.debugCr(reconciliation, "Annotation {}={}", ANNO_STRIMZI_IO_REBALANCE, KafkaRebalanceAnnotation.approve);
-                    return requestRebalance(reconciliation, host, apiClient, kafkaRebalance, false, rebalanceOptionsBuilder);
+                    return requestRebalance(reconciliation, host, apiClient, kafkaRebalance, false, rebalanceOptionsBuilder, true);


you are using the stop ongoing execution only when we are approving a proposal ... what about refreshing while it's in ProposalPending or Rebalancing? (of course refreshing in ProposalReady doesn't make sense because there is nothing going on).

Why do we need to stop ongoing execution when approving a proposal in the first place?

Good point ... what's the use case you saw @tinaselenge ?

So this is my understanding of the code based on my recreation of the problem, please correct me if I'm wrong. The issue mainly happens when you refresh, while the KafkaRabalance is in Rebalancing state (let's say rebalance_1 is in progress). When refresh annotation applied, we send a request to stop the ongoing rebalance operation, however this does not stop rebalance_1 completely, and that's the problem. Immediately after the stop request completes, we send a request for a new proposal (dry_run=true), then the state becomes ProposalReady. If auto-approval is set or user manually approved this new proposal, we send a new rebalance request (dry_run=false), let's call it rebalance_2. However the request for rebalance_2 fails if the rebalance_1 is still in progress. This change makes sure that rebalance_1 is completely stopped and then the request for rebalance_2 is processed. I tested this only with auto-approval annotation, however I think it is possible to happen with manual approval by user, if they approved it quickly and rebalance_1 was taking a long time.

I don't think we need this flag set for ProposalPending or Rebalancing states, because when refresh annotation is applied while on these 2 states, we only send a request for new proposal (dry_run=true). We only need this flag set when we send dry_run=false requests.

thanks @tinaselenge what you described is what I faced during auto-rebalance where "auto-approval" is set and people can ask two consecutive scale up/down which result in two consecutive rebalancing requests.

I don't think we need this flag set for ProposalPending or Rebalancing states, because when refresh annotation is applied while on these 2 states, we only send a request for new proposal (dry_run=false). We only need this flag set when we send dry_run=true requests.

Did you mean the other way around? I mean "we only send a request for new proposal (dry_run=true). We only need this flag set when we send dry_run=false requests."

Because in PendingProposal we obviously ask for a new ... proposal with dry_run=true not false.
Anyway even when we ask for a new proposal, while a proposal is still processing, we should be sure that CC doesn't reply the same way so we should stop execution anyway (maybe not, because it's different from an actual rebalance, but it's worth checking on CC codebase). The same applied in Rebalancing where we ask for a new proposal (dry_run=true) is refresh is applied.

Long story short, the question is, is stop going execution flag valid only when you run an actual rebalance or even when you asked to CC to process a proposal?

Did you mean the other way around? I mean "we only send a request for new proposal (dry_run=true). We only need this flag set when we send dry_run=false requests."

Yes, I meant the other way around. Sorry!

Because in PendingProposal we obviously ask for a new ... proposal with dry_run=true not false.
Anyway even when we ask for a new proposal, while a proposal is still processing, we should be sure that CC doesn't reply the same way so we should stop execution anyway (maybe not, because it's different from an actual rebalance, but it's worth checking on CC codebase). The same applied in Rebalancing where we ask for a new proposal (dry_run=true) is refresh is applied.

I agree that we should stop the ongoing execution when we ask for a new proposal due to "refresh" annotation so the optimisation calculation is up to date. And we do already have this logic, if you see the "Rebalancing()" function. In this function, the operator first calls the stop endpoint and then requests a new proposal. The issue is sometimes CC internally can still have in progress batch, even though stop endpoint was called. I can spend sometime looking into CC code.

Long story short, the question is, is stop going execution flag valid only when you run an actual rebalance or even when you asked to CC to process a proposal?

It seems to be effective when we run an actual rebalance, not when we asked for a proposal.

The issue is sometimes CC internally can still have in progress batch, even though stop endpoint was called.

So it means that even if we call the stopExecution we should then call the requestRebalance with the stop_ongoing_execution=true even in the onRebalancing. At this point I was wondering if we should actually totally remove the stopExecution call and just pass the corresponding flag.

Could it be still beneficial to have the stopExecution call because it gets called before requesting a new proposal?The flag is set in a later call to request an actual rebalance operation but the existing rebalance might be already stopped by the earlier stopExecution, and in the case it's not, the flag helps.

ppatierno · 2024-10-16T15:58:22Z

@tinaselenge could you rebase against main, I merged another PR fixing some reconciliation issues on KafkaRebalance.

Signed-off-by: Gantigmaa Selenge <[email protected]>

tinaselenge · 2024-10-17T10:55:24Z

Thanks for the review!
@ppatierno I have added a test case to KafkaRebalanceStateMachineTest and then rebased against main.

tinaselenge · 2024-10-17T11:30:35Z

I also updated the PR description with more details.

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java

Signed-off-by: Gantigmaa Selenge <[email protected]>

ppatierno · 2024-10-22T10:38:28Z

/azp run regression

azure-pipelines · 2024-10-22T10:38:41Z

Azure Pipelines successfully started running 1 pipeline(s).

tinaselenge · 2024-10-25T10:35:32Z

This might not be the best way to solve this issue, as this change would result in stopping other unrelated executions such as topic reconfiguration whenever we request a full run rebalance (regardless of the refresh annotation). It doesn't seem to be straight forward as adding this flag. Ideally, we would add this flag only when full rebalance request being sent after refresh annotation is applied, however, currently we only send dry run rebalance request after refresh annotation is applied. Once the dry run rebalance is completed, we remove the refresh annotation and start a new reconciliation. In this new reconciliation, we request a full run rebalance but at this point, we don't know if there was a refresh annotation or not, so we can't set the flag based on it.

tinaselenge · 2024-11-05T11:02:45Z

Closing the PR due to #10571 (comment). I will open a separate PR for adding tests for the current behaviour of stopping execution via stop endpoint on renew annotation.

tinaselenge force-pushed the fix-rebalancing-bug branch from c0dfd10 to a574286 Compare October 16, 2024 12:34

tinaselenge marked this pull request as ready for review October 16, 2024 12:35

tinaselenge requested review from ppatierno and ShubhamRwt October 16, 2024 12:39

ppatierno reviewed Oct 16, 2024

View reviewed changes

ppatierno requested a review from a team October 16, 2024 13:41

ppatierno added this to the 0.44.0 milestone Oct 16, 2024

tinaselenge added 2 commits October 17, 2024 11:53

Add stop_ongoing_execution flag to rebalance requests for full run

c8b8da4

Signed-off-by: Gantigmaa Selenge <[email protected]>

Add a testcase to KafkaRebalanceStateMachineTest

424bafd

Signed-off-by: Gantigmaa Selenge <[email protected]>

tinaselenge force-pushed the fix-rebalancing-bug branch from a574286 to 424bafd Compare October 17, 2024 10:54

ppatierno reviewed Oct 21, 2024

View reviewed changes

.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java Show resolved Hide resolved

scholzj modified the milestones: 0.44.0, 0.45.0 Oct 21, 2024

Address review comment

c591f9d

Signed-off-by: Gantigmaa Selenge <[email protected]>

tinaselenge marked this pull request as draft October 25, 2024 10:36

tinaselenge closed this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stop_ongoing_execution flag to rebalance requests for full run #10703

Add stop_ongoing_execution flag to rebalance requests for full run #10703

tinaselenge commented Oct 11, 2024 •

edited

Loading

ppatierno left a comment

ppatierno Oct 16, 2024

scholzj Oct 16, 2024

ppatierno Oct 16, 2024

tinaselenge Oct 17, 2024 •

edited

Loading

ppatierno Oct 17, 2024

tinaselenge Oct 17, 2024

tinaselenge Oct 17, 2024

tinaselenge Oct 17, 2024

ppatierno Oct 17, 2024

tinaselenge Oct 21, 2024

ppatierno commented Oct 16, 2024

tinaselenge commented Oct 17, 2024

tinaselenge commented Oct 17, 2024

ppatierno commented Oct 22, 2024

azure-pipelines bot commented Oct 22, 2024

tinaselenge commented Oct 25, 2024

tinaselenge commented Nov 5, 2024

Add stop_ongoing_execution flag to rebalance requests for full run #10703

Add stop_ongoing_execution flag to rebalance requests for full run #10703

Conversation

tinaselenge commented Oct 11, 2024 • edited Loading

Type of change

Description

Checklist

ppatierno left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tinaselenge Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppatierno commented Oct 16, 2024

tinaselenge commented Oct 17, 2024

tinaselenge commented Oct 17, 2024

ppatierno commented Oct 22, 2024

azure-pipelines bot commented Oct 22, 2024

tinaselenge commented Oct 25, 2024

tinaselenge commented Nov 5, 2024

tinaselenge commented Oct 11, 2024 •

edited

Loading

tinaselenge Oct 17, 2024 •

edited

Loading