-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for namespace to finish terminating on restore. #691
Comments
👍 to detecting that a namespace is terminating when we're about to restore it. In terms of notifications, we don't yet have any way to provide updates to the user as a backup or restore is in flight. We have #20 and #21 to track these enhancements, but we haven't started working on them. Please feel free to comment on either issue with suggestions. 1 possible approach is to use Kubernetes We will need to figure out timings/timeouts. I don't have a solid answer right now. |
Sounds good. Thanks. |
Separated but related to #469 |
Expanding this a bit to include a situation where the user restores and does not first remove the namespace. When we reach the hard cap timeout, let the user if the namespace was not empty or was terminating. The enhancement for pre-existing namespace functionality is requested in #469 |
testing porter |
Options:
|
Should the timeout be a parameter on the |
I think I'd probably start with just a server-level timeout that's configurable by server flag, with a sane default. Individual restore-level overrides probably aren't necessary. |
Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>
Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>
This change also waits for persistent volumes that are or had been bound to pods/persistent volume claims in a terminating namespace to be fully deleted before trying to restore them from a snapshot. Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>
This change also waits for persistent volumes that are or had been bound to pods/persistent volume claims in a terminating namespace to be fully deleted before trying to restore them from a snapshot. Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>
Updated milestone to reflect that #826 is likely to land in 0.10.0 |
This change also waits for persistent volumes that are or had been bound to pods/persistent volume claims in a terminating namespace to be fully deleted before trying to restore them from a snapshot. Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>
@jhamilton1 So as we've been iterating on this, we've tried just waiting on the namespace itself. That leads to the following scenario:
We've explored looking at PVs and waiting on them, too, as well as walking up the tree to look at a namespace associated with a given PV, but haven't really come up with a solution we like. My question is this: Is the scenario I described ok, or is the expectation in this issue that the PV would also be restored? |
@ncdc going to take over from Jesse. |
@rbankston, @ncdc, and myself had a call, and this is what we're going to move forward with:
Our guarantee will be that we wait for the PV and/or the NS as long as the top level delete was issued before the restore. The 'top-level' could be a namespace, or a selector that matches the relevant PVC. If the restore is already started and a delete request is issued during, we make no guarantees. |
Also check the PV to see if its reclaim policy is delete and if it's been released. |
Thanks for verifying how this should look and feel for the users going forward. Can I let the customer know that v0.10.1 is the version this should land in? |
@rbankston yes, after we have sufficiently tested it. |
Thanks @ncdc for the verification. |
v0.10.1 was a bugfix - I think my unit tests are covering most cases and have asked @skriss to review today. We may need help testing the changes to be sure the rest of the restore flow wasn't adversely affected. |
@rbankston would you or someone from the CRE team be available to help test the changes to ensure they meet the customer's requirements? |
@rosskukulinski sure can take a look. Waiting to hear back if this resolves everything. |
@rbankston Are you waiting on our team, or the customer? @skriss double checked this and it appeared to work for him. |
@nrb how would Ralph test your PR? Can we get him a container & client build to test with? |
My process was this:
|
@nrb can you build a container image and push somewhere for @rbankston to test with? Even just on dockerhub somewhere? |
@rosskukulinski got the container built thanks to great docs and verified that I'm able to restore something still in the process of deleting without errors as expected. Thanks for the steps @nrb. |
Awesome! That's great to hear @rbankston. Thanks for lending a hand with the testing. |
Describe the solution you'd like
If the user needs to restore an entire namespace, they will first need to delete the namespace. It would be helpful if ark would wait for the namespace to finish terminating and then execute the restore. It would also be nice if ark would inform the user that it is waiting for the termination to complete so that the user has some idea about the time that elapses before the restore is complete.
Anything else you would like to add:
Does there need to be a hard cap on how long ark will wait before dropping out and notify the user that there is some issue?
Depending on the user-defined settings for a particular pod, deleting a namespace can take a considerable amount of time. Parameters such as "terminationGracePeriodSeconds: " can have a significant impact on the termination time. Should we remind the user that certain conditions will prolong the restore time?
Environment:
ark version
): v0.9.0kubectl version
):v1.10.3/etc/os-release
):Ubuntu 16.04.4 LTSImpact on the customer:
The customer was unable to restore a namespace.
Zendesk Ticket ID:
ZD ticket 255
Zendesk Ticket Priority/ Urgency
Normal
The text was updated successfully, but these errors were encountered: