Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stop_ongoing_execution flag to rebalance requests for full run #10703

Closed
wants to merge 3 commits into from

Conversation

tinaselenge
Copy link
Contributor

@tinaselenge tinaselenge commented Oct 11, 2024

Type of change

Select the type of your PR

  • Bugfix

Description

This adds an option to set stop_ongoing_execution flag.
The new flag is set when sending rebalancing requests for full run to avoid "Cannot start a new execution while there is an ongoing execution" error. This error can happen even if the operator calls CC's endpoint /stop_proposal_execution before sending a new rebalance request. The reason for this is documented here as:

Note that Cruise Control does not wait for the ongoing batch to finish when it stops execution, i.e. the in-progress batch may still be running after Cruise Control stops the execution.

Here is the possible flow that causes this issue:

  1. Currently KafkaRebalance CR for removing brokers is in Rebalancing state
  2. User updates the CR with a refresh annotation and with a new set of brokers to remove
  3. The operator sends a request to /stop_proposal_execution endpoint to stop the current rebalance operation
  4. The request completes successfully, however there are still in-progress batch for the current balance operation in CC.
  5. The operator sends a request for a new proposal for the updated set of brokers to remove.
  6. The new proposal is ready, therefore the KafkaRebalance is in ProposalReady state
  7. The operator sends a request to execute the removal of the updated set of brokers
  8. This request fails because there are still in-progress batch of the previous rebalance operation.
  9. KafkaRebalance is in NotReady state due to this failure.

The fix is to update step 7, to send the request with "stop_ongoing_execution" flag set to true, so that CC first stops the still in-progress batch of the previous rebalance operation before processing the new rebalance request. The flow would become:

  1. The operator sends a request to execute the removal of the updated set of brokers with "stop_ongoing_execution" flag set to true
  2. CC waits for the still in-progress batch of the previous rebalance operation to stop
  3. CC processes the request to execute the removal of the updated set of brokers
  4. The KafkaRebalance is in Ready.

There were no test cases for the existing flow where operator calls /stop_proposal_execution when KafkaRebalance is in Rebalancing state and then refresh annotation is applied. These missing tests cases are added , however they do not test the new flow because we cannot mock the internal state of CC to have still in-progress batch.

Closes #10571

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@tinaselenge tinaselenge marked this pull request as ready for review October 16, 2024 12:35
Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good, I left one comment.
Other than that, could you please add one more test in the KafkaRebalanceStateMachineTest class please?

@@ -862,7 +862,7 @@ private Future<MapAndStatus<ConfigMap, KafkaRebalanceStatus>> onProposalReady(Re
return configMapOperator.getAsync(kafkaRebalance.getMetadata().getNamespace(), kafkaRebalance.getMetadata().getName()).compose(loadmap -> Future.succeededFuture(new MapAndStatus<>(loadmap, buildRebalanceStatusFromPreviousStatus(kafkaRebalance.getStatus(), StatusUtils.validate(reconciliation, kafkaRebalance)))));
case approve:
LOGGER.debugCr(reconciliation, "Annotation {}={}", ANNO_STRIMZI_IO_REBALANCE, KafkaRebalanceAnnotation.approve);
return requestRebalance(reconciliation, host, apiClient, kafkaRebalance, false, rebalanceOptionsBuilder);
return requestRebalance(reconciliation, host, apiClient, kafkaRebalance, false, rebalanceOptionsBuilder, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are using the stop ongoing execution only when we are approving a proposal ... what about refreshing while it's in ProposalPending or Rebalancing? (of course refreshing in ProposalReady doesn't make sense because there is nothing going on).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to stop ongoing execution when approving a proposal in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point ... what's the use case you saw @tinaselenge ?

Copy link
Contributor Author

@tinaselenge tinaselenge Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is my understanding of the code based on my recreation of the problem, please correct me if I'm wrong. The issue mainly happens when you refresh, while the KafkaRabalance is in Rebalancing state (let's say rebalance_1 is in progress). When refresh annotation applied, we send a request to stop the ongoing rebalance operation, however this does not stop rebalance_1 completely, and that's the problem. Immediately after the stop request completes, we send a request for a new proposal (dry_run=true), then the state becomes ProposalReady. If auto-approval is set or user manually approved this new proposal, we send a new rebalance request (dry_run=false), let's call it rebalance_2. However the request for rebalance_2 fails if the rebalance_1 is still in progress. This change makes sure that rebalance_1 is completely stopped and then the request for rebalance_2 is processed. I tested this only with auto-approval annotation, however I think it is possible to happen with manual approval by user, if they approved it quickly and rebalance_1 was taking a long time.

I don't think we need this flag set for ProposalPending or Rebalancing states, because when refresh annotation is applied while on these 2 states, we only send a request for new proposal (dry_run=true). We only need this flag set when we send dry_run=false requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @tinaselenge what you described is what I faced during auto-rebalance where "auto-approval" is set and people can ask two consecutive scale up/down which result in two consecutive rebalancing requests.

I don't think we need this flag set for ProposalPending or Rebalancing states, because when refresh annotation is applied while on these 2 states, we only send a request for new proposal (dry_run=false). We only need this flag set when we send dry_run=true requests.

Did you mean the other way around? I mean "we only send a request for new proposal (dry_run=true). We only need this flag set when we send dry_run=false requests."

Because in PendingProposal we obviously ask for a new ... proposal with dry_run=true not false.
Anyway even when we ask for a new proposal, while a proposal is still processing, we should be sure that CC doesn't reply the same way so we should stop execution anyway (maybe not, because it's different from an actual rebalance, but it's worth checking on CC codebase). The same applied in Rebalancing where we ask for a new proposal (dry_run=true) is refresh is applied.

Long story short, the question is, is stop going execution flag valid only when you run an actual rebalance or even when you asked to CC to process a proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean the other way around? I mean "we only send a request for new proposal (dry_run=true). We only need this flag set when we send dry_run=false requests."

Yes, I meant the other way around. Sorry!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in PendingProposal we obviously ask for a new ... proposal with dry_run=true not false.
Anyway even when we ask for a new proposal, while a proposal is still processing, we should be sure that CC doesn't reply the same way so we should stop execution anyway (maybe not, because it's different from an actual rebalance, but it's worth checking on CC codebase). The same applied in Rebalancing where we ask for a new proposal (dry_run=true) is refresh is applied.

I agree that we should stop the ongoing execution when we ask for a new proposal due to "refresh" annotation so the optimisation calculation is up to date. And we do already have this logic, if you see the "Rebalancing()" function. In this function, the operator first calls the stop endpoint and then requests a new proposal. The issue is sometimes CC internally can still have in progress batch, even though stop endpoint was called. I can spend sometime looking into CC code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long story short, the question is, is stop going execution flag valid only when you run an actual rebalance or even when you asked to CC to process a proposal?

It seems to be effective when we run an actual rebalance, not when we asked for a proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is sometimes CC internally can still have in progress batch, even though stop endpoint was called.

So it means that even if we call the stopExecution we should then call the requestRebalance with the stop_ongoing_execution=true even in the onRebalancing. At this point I was wondering if we should actually totally remove the stopExecution call and just pass the corresponding flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be still beneficial to have the stopExecution call because it gets called before requesting a new proposal?The flag is set in a later call to request an actual rebalance operation but the existing rebalance might be already stopped by the earlier stopExecution, and in the case it's not, the flag helps.

@ppatierno ppatierno requested a review from a team October 16, 2024 13:41
@ppatierno ppatierno added this to the 0.44.0 milestone Oct 16, 2024
@ppatierno
Copy link
Member

@tinaselenge could you rebase against main, I merged another PR fixing some reconciliation issues on KafkaRebalance.

@tinaselenge
Copy link
Contributor Author

Thanks for the review!
@ppatierno I have added a test case to KafkaRebalanceStateMachineTest and then rebased against main.

@tinaselenge
Copy link
Contributor Author

I also updated the PR description with more details.

@scholzj scholzj modified the milestones: 0.44.0, 0.45.0 Oct 21, 2024
Signed-off-by: Gantigmaa Selenge <[email protected]>
@ppatierno
Copy link
Member

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tinaselenge
Copy link
Contributor Author

This might not be the best way to solve this issue, as this change would result in stopping other unrelated executions such as topic reconfiguration whenever we request a full run rebalance (regardless of the refresh annotation). It doesn't seem to be straight forward as adding this flag. Ideally, we would add this flag only when full rebalance request being sent after refresh annotation is applied, however, currently we only send dry run rebalance request after refresh annotation is applied. Once the dry run rebalance is completed, we remove the refresh annotation and start a new reconciliation. In this new reconciliation, we request a full run rebalance but at this point, we don't know if there was a refresh annotation or not, so we can't set the flag based on it.

@tinaselenge tinaselenge marked this pull request as draft October 25, 2024 10:36
@tinaselenge
Copy link
Contributor Author

Closing the PR due to #10571 (comment). I will open a separate PR for adding tests for the current behaviour of stopping execution via stop endpoint on renew annotation.

@tinaselenge tinaselenge closed this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants