-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stop_ongoing_execution flag to rebalance requests for full run #10703
Conversation
c0dfd10
to
a574286
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly good, I left one comment.
Other than that, could you please add one more test in the KafkaRebalanceStateMachineTest
class please?
@@ -862,7 +862,7 @@ private Future<MapAndStatus<ConfigMap, KafkaRebalanceStatus>> onProposalReady(Re | |||
return configMapOperator.getAsync(kafkaRebalance.getMetadata().getNamespace(), kafkaRebalance.getMetadata().getName()).compose(loadmap -> Future.succeededFuture(new MapAndStatus<>(loadmap, buildRebalanceStatusFromPreviousStatus(kafkaRebalance.getStatus(), StatusUtils.validate(reconciliation, kafkaRebalance))))); | |||
case approve: | |||
LOGGER.debugCr(reconciliation, "Annotation {}={}", ANNO_STRIMZI_IO_REBALANCE, KafkaRebalanceAnnotation.approve); | |||
return requestRebalance(reconciliation, host, apiClient, kafkaRebalance, false, rebalanceOptionsBuilder); | |||
return requestRebalance(reconciliation, host, apiClient, kafkaRebalance, false, rebalanceOptionsBuilder, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are using the stop ongoing execution only when we are approving a proposal ... what about refreshing while it's in ProposalPending or Rebalancing? (of course refreshing in ProposalReady doesn't make sense because there is nothing going on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to stop ongoing execution when approving a proposal in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point ... what's the use case you saw @tinaselenge ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is my understanding of the code based on my recreation of the problem, please correct me if I'm wrong. The issue mainly happens when you refresh, while the KafkaRabalance is in Rebalancing state (let's say rebalance_1 is in progress). When refresh annotation applied, we send a request to stop the ongoing rebalance operation, however this does not stop rebalance_1 completely, and that's the problem. Immediately after the stop request completes, we send a request for a new proposal (dry_run=true), then the state becomes ProposalReady. If auto-approval is set or user manually approved this new proposal, we send a new rebalance request (dry_run=false), let's call it rebalance_2. However the request for rebalance_2 fails if the rebalance_1 is still in progress. This change makes sure that rebalance_1 is completely stopped and then the request for rebalance_2 is processed. I tested this only with auto-approval annotation, however I think it is possible to happen with manual approval by user, if they approved it quickly and rebalance_1 was taking a long time.
I don't think we need this flag set for ProposalPending or Rebalancing states, because when refresh annotation is applied while on these 2 states, we only send a request for new proposal (dry_run=true). We only need this flag set when we send dry_run=false requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @tinaselenge what you described is what I faced during auto-rebalance where "auto-approval" is set and people can ask two consecutive scale up/down which result in two consecutive rebalancing requests.
I don't think we need this flag set for ProposalPending or Rebalancing states, because when refresh annotation is applied while on these 2 states, we only send a request for new proposal (dry_run=false). We only need this flag set when we send dry_run=true requests.
Did you mean the other way around? I mean "we only send a request for new proposal (dry_run=true). We only need this flag set when we send dry_run=false requests."
Because in PendingProposal
we obviously ask for a new ... proposal with dry_run=true not false.
Anyway even when we ask for a new proposal, while a proposal is still processing, we should be sure that CC doesn't reply the same way so we should stop execution anyway (maybe not, because it's different from an actual rebalance, but it's worth checking on CC codebase). The same applied in Rebalancing
where we ask for a new proposal (dry_run=true) is refresh is applied.
Long story short, the question is, is stop going execution flag valid only when you run an actual rebalance or even when you asked to CC to process a proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean the other way around? I mean "we only send a request for new proposal (dry_run=true). We only need this flag set when we send dry_run=false requests."
Yes, I meant the other way around. Sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in PendingProposal we obviously ask for a new ... proposal with dry_run=true not false.
Anyway even when we ask for a new proposal, while a proposal is still processing, we should be sure that CC doesn't reply the same way so we should stop execution anyway (maybe not, because it's different from an actual rebalance, but it's worth checking on CC codebase). The same applied in Rebalancing where we ask for a new proposal (dry_run=true) is refresh is applied.
I agree that we should stop the ongoing execution when we ask for a new proposal due to "refresh" annotation so the optimisation calculation is up to date. And we do already have this logic, if you see the "Rebalancing()" function. In this function, the operator first calls the stop endpoint and then requests a new proposal. The issue is sometimes CC internally can still have in progress batch, even though stop endpoint was called. I can spend sometime looking into CC code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long story short, the question is, is stop going execution flag valid only when you run an actual rebalance or even when you asked to CC to process a proposal?
It seems to be effective when we run an actual rebalance, not when we asked for a proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is sometimes CC internally can still have in progress batch, even though stop endpoint was called.
So it means that even if we call the stopExecution
we should then call the requestRebalance
with the stop_ongoing_execution=true even in the onRebalancing
. At this point I was wondering if we should actually totally remove the stopExecution
call and just pass the corresponding flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be still beneficial to have the stopExecution call because it gets called before requesting a new proposal?The flag is set in a later call to request an actual rebalance operation but the existing rebalance might be already stopped by the earlier stopExecution, and in the case it's not, the flag helps.
@tinaselenge could you rebase against main, I merged another PR fixing some reconciliation issues on KafkaRebalance. |
Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
a574286
to
424bafd
Compare
Thanks for the review! |
I also updated the PR description with more details. |
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Show resolved
Hide resolved
Signed-off-by: Gantigmaa Selenge <[email protected]>
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
This might not be the best way to solve this issue, as this change would result in stopping other unrelated executions such as topic reconfiguration whenever we request a full run rebalance (regardless of the refresh annotation). It doesn't seem to be straight forward as adding this flag. Ideally, we would add this flag only when full rebalance request being sent after refresh annotation is applied, however, currently we only send dry run rebalance request after refresh annotation is applied. Once the dry run rebalance is completed, we remove the refresh annotation and start a new reconciliation. In this new reconciliation, we request a full run rebalance but at this point, we don't know if there was a refresh annotation or not, so we can't set the flag based on it. |
Closing the PR due to #10571 (comment). I will open a separate PR for adding tests for the current behaviour of stopping execution via stop endpoint on renew annotation. |
Type of change
Select the type of your PR
Description
This adds an option to set stop_ongoing_execution flag.
The new flag is set when sending rebalancing requests for full run to avoid "Cannot start a new execution while there is an ongoing execution" error. This error can happen even if the operator calls CC's endpoint
/stop_proposal_execution
before sending a new rebalance request. The reason for this is documented here as:Here is the possible flow that causes this issue:
/stop_proposal_execution
endpoint to stop the current rebalance operationThe fix is to update step 7, to send the request with "stop_ongoing_execution" flag set to true, so that CC first stops the still in-progress batch of the previous rebalance operation before processing the new rebalance request. The flow would become:
There were no test cases for the existing flow where operator calls
/stop_proposal_execution
when KafkaRebalance is in Rebalancing state and then refresh annotation is applied. These missing tests cases are added , however they do not test the new flow because we cannot mock the internal state of CC to have still in-progress batch.Closes #10571
Checklist
Please go through this checklist and make sure all applicable tasks have been done