Wait for namespace to finish terminating on restore. #691

jhamilton1 · 2018-07-20T19:13:08Z

Describe the solution you'd like
If the user needs to restore an entire namespace, they will first need to delete the namespace. It would be helpful if ark would wait for the namespace to finish terminating and then execute the restore. It would also be nice if ark would inform the user that it is waiting for the termination to complete so that the user has some idea about the time that elapses before the restore is complete.

Anything else you would like to add:

Does there need to be a hard cap on how long ark will wait before dropping out and notify the user that there is some issue?

Depending on the user-defined settings for a particular pod, deleting a namespace can take a considerable amount of time. Parameters such as "terminationGracePeriodSeconds: " can have a significant impact on the termination time. Should we remind the user that certain conditions will prolong the restore time?

Environment:

Ark version (use ark version): v0.9.0
Kubernetes version (use kubectl version):v1.10.3
Kubernetes installer & version: heptio quickstart
Cloud provider or hardware configuration: aws
OS (e.g. from /etc/os-release):Ubuntu 16.04.4 LTS

Impact on the customer:
The customer was unable to restore a namespace.

Zendesk Ticket ID:
ZD ticket 255

Zendesk Ticket Priority/ Urgency
Normal

The text was updated successfully, but these errors were encountered:

ncdc · 2018-07-20T19:16:09Z

👍 to detecting that a namespace is terminating when we're about to restore it.

In terms of notifications, we don't yet have any way to provide updates to the user as a backup or restore is in flight. We have #20 and #21 to track these enhancements, but we haven't started working on them. Please feel free to comment on either issue with suggestions. 1 possible approach is to use Kubernetes events.

We will need to figure out timings/timeouts. I don't have a solid answer right now.

jhamilton1 · 2018-07-20T19:20:08Z

Sounds good. Thanks.

rosskukulinski · 2018-07-20T19:46:25Z

Separated but related to #469

jhamilton1 · 2018-07-20T19:52:52Z

Expanding this a bit to include a situation where the user restores and does not first remove the namespace. When we reach the hard cap timeout, let the user if the namespace was not empty or was terminating. The enhancement for pre-existing namespace functionality is requested in #469

jhamilton1 · 2018-07-25T18:26:14Z

testing porter

rosskukulinski · 2018-08-06T20:22:16Z

Options:

Dry-run (a big undertaking Pass dry-run flag to plugins #654 )
Allow Ark to wait, and provide a timeout option (log as error or warning)
- Flag to not retry/wait

nrb · 2018-08-22T16:30:33Z

Should the timeout be a parameter on the ark restore create command? Perhaps a default that can be overridden on ark server? Or a combination of the two?

skriss · 2018-08-22T21:54:35Z

I think I'd probably start with just a server-level timeout that's configurable by server flag, with a sane default. Individual restore-level overrides probably aren't necessary.

Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>

This change also waits for persistent volumes that are or had been bound to pods/persistent volume claims in a terminating namespace to be fully deleted before trying to restore them from a snapshot. Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>

rosskukulinski · 2018-10-17T14:21:37Z

Updated milestone to reflect that #826 is likely to land in 0.10.0

This change also waits for persistent volumes that are or had been bound to pods/persistent volume claims in a terminating namespace to be fully deleted before trying to restore them from a snapshot. Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>

nrb · 2018-12-05T18:52:21Z

@jhamilton1 So as we've been iterating on this, we've tried just waiting on the namespace itself. That leads to the following scenario:

User issues kubectl delete mynamespace --wait=false
User issues ark restore create --from-backup backup-with-mynamespace
Ark walks down its prioritized resources, and encounters PVs first
PV associated with a PVC in mynamespace is not yet delete, so Ark sees it and decides to not restore it.
Ark gets to PVCs, which are namespaced, and then waits for mynamespace to terminate.
mynamespace disappears, and everything in the namespace is restored
Concurrently, the PV is deleted.
End state: everything in the namespace is restored, but the PV was lost, because at the time Ark looked at it, it wasn't gone.

We've explored looking at PVs and waiting on them, too, as well as walking up the tree to look at a namespace associated with a given PV, but haven't really come up with a solution we like.

My question is this: Is the scenario I described ok, or is the expectation in this issue that the PV would also be restored?

rbankston · 2018-12-05T18:57:16Z

@ncdc going to take over from Jesse.

nrb · 2018-12-05T19:20:01Z

@rbankston, @ncdc, and myself had a call, and this is what we're going to move forward with:

Wait up to a timeout for namespaces to delete, if they have a deletion timestamp.
For PVs:
- check if there's a deletion timestamp on the PV.
- if not, check the associated PVC for a deletiontimestamp
- if PVC exists and has no deletion timestamp, check the PVC's namespace for deletion timestamp
- if NS exists and has no deletion timestamp, we're done and will not wait for the PV
- if any of the of the PV, PVC, or NS exist and have a deletion timestamp, wait up to the timeout before restoring the PV.

Our guarantee will be that we wait for the PV and/or the NS as long as the top level delete was issued before the restore. The 'top-level' could be a namespace, or a selector that matches the relevant PVC. If the restore is already started and a delete request is issued during, we make no guarantees.

ncdc · 2018-12-05T19:23:49Z

Also check the PV to see if its reclaim policy is delete and if it's been released.

rbankston · 2018-12-06T19:08:53Z

Thanks for verifying how this should look and feel for the users going forward. Can I let the customer know that v0.10.1 is the version this should land in?

ncdc · 2018-12-07T15:13:15Z

@rbankston yes, after we have sufficiently tested it.

rbankston · 2018-12-10T18:41:08Z

Thanks @ncdc for the verification.

rbankston · 2019-01-11T16:34:04Z

Hello @ncdc and @nrb did this fix make it to the v0.10.1 release or did it get pushed back? I'm not seeing mentions in the changelog for it.

nrb · 2019-01-11T16:35:28Z

v0.10.1 was a bugfix - I think my unit tests are covering most cases and have asked @skriss to review today. We may need help testing the changes to be sure the rest of the restore flow wasn't adversely affected.

rosskukulinski · 2019-01-13T03:36:49Z

@rbankston would you or someone from the CRE team be available to help test the changes to ensure they meet the customer's requirements?

rbankston · 2019-01-15T17:03:42Z

@rosskukulinski sure can take a look. Waiting to hear back if this resolves everything.

nrb · 2019-01-15T17:13:43Z

@rbankston Are you waiting on our team, or the customer? @skriss double checked this and it appeared to work for him.

rosskukulinski · 2019-01-15T18:02:44Z

@nrb how would Ralph test your PR? Can we get him a container & client build to test with?

nrb · 2019-01-15T18:07:17Z

My process was this:

Create an nginx example application from our included YAML.
Backup the nginx-example namespace
Delete the nginx-example namespace and immediately issue a restore command (kubectl delete namespace nginx-example --wait=false; ark restore create <mybackup>)
The namespace, pvc, and pv should all be restored successfully. Previously, these objects would be marked as duplicates in the restore, since they existed when Ark checked, but they would then delete soon after.

rosskukulinski · 2019-01-16T01:10:23Z

@nrb can you build a container image and push somewhere for @rbankston to test with? Even just on dockerhub somewhere?

rbankston · 2019-01-23T22:56:08Z

@rosskukulinski got the container built thanks to great docs and verified that I'm able to restore something still in the process of deleting without errors as expected. Thanks for the steps @nrb.

rosskukulinski · 2019-01-24T03:18:22Z

Awesome! That's great to hear @rbankston. Thanks for lending a hand with the testing.

ncdc mentioned this issue Jul 20, 2018

Restore progress #21

Closed

jhamilton1 changed the title ~~Wait for namespace to finish terminating on restore.~~ ZD Ticket #255 Wait for namespace to finish terminating on restore. Jul 23, 2018

jhamilton1 changed the title ~~ZD Ticket #255 Wait for namespace to finish terminating on restore.~~ ZD Ticket #255: Wait for namespace to finish terminating on restore. Jul 23, 2018

rosskukulinski added P1 - Important Enhancement/User End-User Enhancement to Velero labels Jul 24, 2018

rosskukulinski added this to the v1.0.0 milestone Jul 24, 2018

jhamilton1 changed the title ~~ZD Ticket #255: Wait for namespace to finish terminating on restore.~~ Ticket 255: Wait for namespace to finish terminating on restore. Jul 25, 2018

rosskukulinski added the Needs Product Blocked needing input or feedback from Product label Aug 6, 2018

nrb self-assigned this Sep 6, 2018

nrb pushed a commit to nrb/velero that referenced this issue Sep 7, 2018

Wait for namespace to terminate before restoring

da69881

Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>

nrb mentioned this issue Sep 7, 2018

Wait for namespace to terminate before restoring #826

Merged

nrb pushed a commit to nrb/velero that referenced this issue Sep 25, 2018

Wait for namespace to terminate before restoring

c648c00

Fixes vmware-tanzu#691 Signed-off-by: Nolan Brubaker <[email protected]>

rosskukulinski modified the milestones: v1.0.0, v0.10.0 Oct 17, 2018

nrb modified the milestones: v0.10.0, v0.10.1 Nov 14, 2018

rbankston added the ZD255 label Dec 5, 2018

rbankston changed the title ~~Ticket 255: Wait for namespace to finish terminating on restore.~~ Wait for namespace to finish terminating on restore. Dec 5, 2018

rosskukulinski removed the Needs Product Blocked needing input or feedback from Product label Jan 15, 2019

skriss closed this as completed in #826 Feb 7, 2019

nrb modified the milestones: v0.10.1, v0.11.0 Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for namespace to finish terminating on restore. #691

Wait for namespace to finish terminating on restore. #691

jhamilton1 commented Jul 20, 2018 •

edited

Loading

ncdc commented Jul 20, 2018

jhamilton1 commented Jul 20, 2018

rosskukulinski commented Jul 20, 2018

jhamilton1 commented Jul 20, 2018 •

edited

Loading

jhamilton1 commented Jul 25, 2018

rosskukulinski commented Aug 6, 2018

nrb commented Aug 22, 2018

skriss commented Aug 22, 2018 •

edited

Loading

rosskukulinski commented Oct 17, 2018

nrb commented Dec 5, 2018

rbankston commented Dec 5, 2018

nrb commented Dec 5, 2018 •

edited

Loading

ncdc commented Dec 5, 2018

rbankston commented Dec 6, 2018

ncdc commented Dec 7, 2018

rbankston commented Dec 10, 2018

rbankston commented Jan 11, 2019

nrb commented Jan 11, 2019

rosskukulinski commented Jan 13, 2019

rbankston commented Jan 15, 2019

nrb commented Jan 15, 2019

rosskukulinski commented Jan 15, 2019

nrb commented Jan 15, 2019

rosskukulinski commented Jan 16, 2019

rbankston commented Jan 23, 2019

rosskukulinski commented Jan 24, 2019

Wait for namespace to finish terminating on restore. #691

Wait for namespace to finish terminating on restore. #691

Comments

jhamilton1 commented Jul 20, 2018 • edited Loading

ncdc commented Jul 20, 2018

jhamilton1 commented Jul 20, 2018

rosskukulinski commented Jul 20, 2018

jhamilton1 commented Jul 20, 2018 • edited Loading

jhamilton1 commented Jul 25, 2018

rosskukulinski commented Aug 6, 2018

nrb commented Aug 22, 2018

skriss commented Aug 22, 2018 • edited Loading

rosskukulinski commented Oct 17, 2018

nrb commented Dec 5, 2018

rbankston commented Dec 5, 2018

nrb commented Dec 5, 2018 • edited Loading

ncdc commented Dec 5, 2018

rbankston commented Dec 6, 2018

ncdc commented Dec 7, 2018

rbankston commented Dec 10, 2018

rbankston commented Jan 11, 2019

nrb commented Jan 11, 2019

rosskukulinski commented Jan 13, 2019

rbankston commented Jan 15, 2019

nrb commented Jan 15, 2019

rosskukulinski commented Jan 15, 2019

nrb commented Jan 15, 2019

rosskukulinski commented Jan 16, 2019

rbankston commented Jan 23, 2019

rosskukulinski commented Jan 24, 2019

jhamilton1 commented Jul 20, 2018 •

edited

Loading

jhamilton1 commented Jul 20, 2018 •

edited

Loading

skriss commented Aug 22, 2018 •

edited

Loading

nrb commented Dec 5, 2018 •

edited

Loading