-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use patch instead of update for GroupSnapshots, VolumeSnapshots, PVCs #1019
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kaovilai The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @kaovilai. Thanks for your PR. I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
9b12e57
to
cdd61a4
Compare
0ca6d08
to
a877f42
Compare
0c28974
to
903eb0c
Compare
c880446
to
06945a0
Compare
6fe49d4
to
1261a31
Compare
05f89ec
to
06733fe
Compare
06733fe
to
8ece6ac
Compare
Signed-off-by: Tiger Kaovilai <[email protected]> remove debugging code Signed-off-by: Tiger Kaovilai <[email protected]> remove more update calls Signed-off-by: Tiger Kaovilai <[email protected]> Fix patch json unmarshal unitTests comparison failures Signed-off-by: Tiger Kaovilai <[email protected]> Fix tests in reactor by not modifying original for patch Signed-off-by: Tiger Kaovilai <[email protected]>
8ece6ac
to
3baa459
Compare
pkg/utils/patch.go
Outdated
for _, i := range indexes { | ||
patches = append(patches, PatchOp{ | ||
Op: "remove", | ||
Path: "/metadata/finalizers/" + fmt.Sprint(i), | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is racy. Consider another controller / user that added a finalizer to an object when this loop runs. Since the PATCH removes indices, it will remove a wrong item.
Is there a way how to remove a value and not an index via Patch()
? If not, then please stick to Update()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is.
One is to remove all then add all back in one call. Let me try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fyi already have it working, just working on adding patch for pvc to framework_test reactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is still considered racy, then will have to wait for json-patch/json-patch2#18 and will move back to update. lmk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably still is racy.. hmm. but at least it external-snapshotter won't hang. User could see the finalizer they added isn't there. but at least this won't have the "wrong index" issue.
Another would be to use update but test it can get out of the "out of date, please apply again"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, the current version does not fix the race. It blindly removes all finalizers and add those that were known at the time the controller processed VolumeSnapshot / VolumeSnapshotContent.
It will again erase any finalizers added in parallel to the controller.
While it is a bandaid, it is better than nothing
No. You are breaking finalizers of someone else. This is not a good behavior.
I think the whole fear of "the object has been modified" error is unjustified. It tells the controller that it has been working with stale data. The controller should check what has changed and try again, if it's still applicable. Or it may discover that the work is not needed any longer. In this case, the VolumeSnapshot / Content should be re-queued with exp. backoff.
there needs to be tests that ensures update fail calls in external-snapshotter are recoverable within few seconds, not over 10 minutes which is what we have been seeing.
If that's true then this is the part that needs to be fixed. And not work around it using patch blindly everywhere. Can you reliably reproduce the issue, e.g. in an unit test? It should be easy to debug what's causing the delay then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we've been hitting this issue everywhere in our CI and other velero users have been hitting it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
name: "2-4 - successful remove Snapshot finalizer after update conflict",
initialSnapshots: newSnapshotArray("snap2-4", "snapuid2-4", "claim2-4", "", classSilver, "", &False, nil, nil, nil, false, true, nil),
initialClaims: newClaimArray("claim2-4", "pvc-uid2-4", "1Gi", "volume2-4", v1.ClaimBound, &classEmpty),
test: testRemoveSnapshotFinalizerAfterUpdateConflict,
expectSuccess: true,
errors: []reactorError{
{"update", "volumesnapshots", errors.NewConflict(crdv1.Resource("volumesnapshots"), "snap2-4", nil)},
},
},
Added this case here.
#1023 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requeue with new data shouldn't take more than a minute. We've seen the external-snapshotter controller stuck for 10minutes+ Maybe timeout/backoff used needs changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reliably reproduce the issue
50% of the time prior to #876
After 876 it's much improved but added update calls after that OCP QE noted still cause issues sometimes.
Thank you @kaovilai !! |
Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Tiger Kaovilai [email protected]
What type of PR is this?
What this PR does / why we need it:
This PR cleans up some of the unit test reactor code, and eliminate update calls that I can see, fixing unit tests to accommodate the changes.
Previously update call was required simply cause unit test is borked. Patch calls was modifying original input causing compare failures.
Which issue(s) this PR fixes:
Fixes #748
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
This PR extend #876 work to later added update calls.