Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for namespace to finish terminating on restore. #691

Closed
jhamilton1 opened this issue Jul 20, 2018 · 26 comments
Closed

Wait for namespace to finish terminating on restore. #691

jhamilton1 opened this issue Jul 20, 2018 · 26 comments
Assignees
Labels
Enhancement/User End-User Enhancement to Velero
Milestone

Comments

@jhamilton1
Copy link
Contributor

jhamilton1 commented Jul 20, 2018

Describe the solution you'd like
If the user needs to restore an entire namespace, they will first need to delete the namespace. It would be helpful if ark would wait for the namespace to finish terminating and then execute the restore. It would also be nice if ark would inform the user that it is waiting for the termination to complete so that the user has some idea about the time that elapses before the restore is complete.

Anything else you would like to add:

Does there need to be a hard cap on how long ark will wait before dropping out and notify the user that there is some issue?

Depending on the user-defined settings for a particular pod, deleting a namespace can take a considerable amount of time. Parameters such as "terminationGracePeriodSeconds: " can have a significant impact on the termination time. Should we remind the user that certain conditions will prolong the restore time?

Environment:

  • Ark version (use ark version): v0.9.0
  • Kubernetes version (use kubectl version):v1.10.3
  • Kubernetes installer & version: heptio quickstart
  • Cloud provider or hardware configuration: aws
  • OS (e.g. from /etc/os-release):Ubuntu 16.04.4 LTS

Impact on the customer:
The customer was unable to restore a namespace.

Zendesk Ticket ID:
ZD ticket 255

Zendesk Ticket Priority/ Urgency
Normal

@ncdc
Copy link
Contributor

ncdc commented Jul 20, 2018

👍 to detecting that a namespace is terminating when we're about to restore it.

In terms of notifications, we don't yet have any way to provide updates to the user as a backup or restore is in flight. We have #20 and #21 to track these enhancements, but we haven't started working on them. Please feel free to comment on either issue with suggestions. 1 possible approach is to use Kubernetes events.

We will need to figure out timings/timeouts. I don't have a solid answer right now.

@jhamilton1
Copy link
Contributor Author

Sounds good. Thanks.

@ncdc ncdc mentioned this issue Jul 20, 2018
@rosskukulinski
Copy link
Contributor

Separated but related to #469

@jhamilton1
Copy link
Contributor Author

jhamilton1 commented Jul 20, 2018

Expanding this a bit to include a situation where the user restores and does not first remove the namespace. When we reach the hard cap timeout, let the user if the namespace was not empty or was terminating. The enhancement for pre-existing namespace functionality is requested in #469

@jhamilton1 jhamilton1 changed the title Wait for namespace to finish terminating on restore. ZD Ticket #255 Wait for namespace to finish terminating on restore. Jul 23, 2018
@jhamilton1 jhamilton1 changed the title ZD Ticket #255 Wait for namespace to finish terminating on restore. ZD Ticket #255: Wait for namespace to finish terminating on restore. Jul 23, 2018
@rosskukulinski rosskukulinski added P1 - Important Enhancement/User End-User Enhancement to Velero labels Jul 24, 2018
@rosskukulinski rosskukulinski added this to the v1.0.0 milestone Jul 24, 2018
@jhamilton1 jhamilton1 changed the title ZD Ticket #255: Wait for namespace to finish terminating on restore. Ticket 255: Wait for namespace to finish terminating on restore. Jul 25, 2018
@jhamilton1
Copy link
Contributor Author

testing porter

@rosskukulinski rosskukulinski added the Needs Product Blocked needing input or feedback from Product label Aug 6, 2018
@rosskukulinski
Copy link
Contributor

Options:

@nrb
Copy link
Contributor

nrb commented Aug 22, 2018

Should the timeout be a parameter on the ark restore create command? Perhaps a default that can be overridden on ark server? Or a combination of the two?

@skriss
Copy link
Contributor

skriss commented Aug 22, 2018

I think I'd probably start with just a server-level timeout that's configurable by server flag, with a sane default. Individual restore-level overrides probably aren't necessary.

@nrb nrb self-assigned this Sep 6, 2018
nrb pushed a commit to nrb/velero that referenced this issue Sep 7, 2018
nrb pushed a commit to nrb/velero that referenced this issue Sep 25, 2018
nrb pushed a commit to nrb/velero that referenced this issue Oct 2, 2018
This change also waits for persistent volumes that are or had been bound
to pods/persistent volume claims in a terminating namespace to be fully
deleted before trying to restore them from a snapshot.

Fixes vmware-tanzu#691

Signed-off-by: Nolan Brubaker <[email protected]>
nrb pushed a commit to nrb/velero that referenced this issue Oct 4, 2018
This change also waits for persistent volumes that are or had been bound
to pods/persistent volume claims in a terminating namespace to be fully
deleted before trying to restore them from a snapshot.

Fixes vmware-tanzu#691

Signed-off-by: Nolan Brubaker <[email protected]>
@rosskukulinski rosskukulinski modified the milestones: v1.0.0, v0.10.0 Oct 17, 2018
@rosskukulinski
Copy link
Contributor

Updated milestone to reflect that #826 is likely to land in 0.10.0

nrb pushed a commit to nrb/velero that referenced this issue Nov 8, 2018
This change also waits for persistent volumes that are or had been bound
to pods/persistent volume claims in a terminating namespace to be fully
deleted before trying to restore them from a snapshot.

Fixes vmware-tanzu#691

Signed-off-by: Nolan Brubaker <[email protected]>
@nrb nrb modified the milestones: v0.10.0, v0.10.1 Nov 14, 2018
@nrb
Copy link
Contributor

nrb commented Dec 5, 2018

@jhamilton1 So as we've been iterating on this, we've tried just waiting on the namespace itself. That leads to the following scenario:

  1. User issues kubectl delete mynamespace --wait=false
  2. User issues ark restore create --from-backup backup-with-mynamespace
  3. Ark walks down its prioritized resources, and encounters PVs first
  4. PV associated with a PVC in mynamespace is not yet delete, so Ark sees it and decides to not restore it.
  5. Ark gets to PVCs, which are namespaced, and then waits for mynamespace to terminate.
  6. mynamespace disappears, and everything in the namespace is restored
  7. Concurrently, the PV is deleted.
  8. End state: everything in the namespace is restored, but the PV was lost, because at the time Ark looked at it, it wasn't gone.

We've explored looking at PVs and waiting on them, too, as well as walking up the tree to look at a namespace associated with a given PV, but haven't really come up with a solution we like.

My question is this: Is the scenario I described ok, or is the expectation in this issue that the PV would also be restored?

@rbankston rbankston added the ZD255 label Dec 5, 2018
@rbankston rbankston changed the title Ticket 255: Wait for namespace to finish terminating on restore. Wait for namespace to finish terminating on restore. Dec 5, 2018
@rbankston
Copy link

@ncdc going to take over from Jesse.

@nrb
Copy link
Contributor

nrb commented Dec 5, 2018

@rbankston, @ncdc, and myself had a call, and this is what we're going to move forward with:

  • Wait up to a timeout for namespaces to delete, if they have a deletion timestamp.
  • For PVs:
    • check if there's a deletion timestamp on the PV.
    • if not, check the associated PVC for a deletiontimestamp
    • if PVC exists and has no deletion timestamp, check the PVC's namespace for deletion timestamp
    • if NS exists and has no deletion timestamp, we're done and will not wait for the PV
    • if any of the of the PV, PVC, or NS exist and have a deletion timestamp, wait up to the timeout before restoring the PV.

Our guarantee will be that we wait for the PV and/or the NS as long as the top level delete was issued before the restore. The 'top-level' could be a namespace, or a selector that matches the relevant PVC. If the restore is already started and a delete request is issued during, we make no guarantees.

@ncdc
Copy link
Contributor

ncdc commented Dec 5, 2018

Also check the PV to see if its reclaim policy is delete and if it's been released.

@rbankston
Copy link

Thanks for verifying how this should look and feel for the users going forward. Can I let the customer know that v0.10.1 is the version this should land in?

@ncdc
Copy link
Contributor

ncdc commented Dec 7, 2018

@rbankston yes, after we have sufficiently tested it.

@rbankston
Copy link

Thanks @ncdc for the verification.

@rbankston
Copy link

Hello @ncdc and @nrb did this fix make it to the v0.10.1 release or did it get pushed back? I'm not seeing mentions in the changelog for it.

@nrb
Copy link
Contributor

nrb commented Jan 11, 2019

v0.10.1 was a bugfix - I think my unit tests are covering most cases and have asked @skriss to review today. We may need help testing the changes to be sure the rest of the restore flow wasn't adversely affected.

@rosskukulinski
Copy link
Contributor

@rbankston would you or someone from the CRE team be available to help test the changes to ensure they meet the customer's requirements?

@rbankston
Copy link

@rosskukulinski sure can take a look. Waiting to hear back if this resolves everything.

@nrb
Copy link
Contributor

nrb commented Jan 15, 2019

@rbankston Are you waiting on our team, or the customer? @skriss double checked this and it appeared to work for him.

@rosskukulinski rosskukulinski removed the Needs Product Blocked needing input or feedback from Product label Jan 15, 2019
@rosskukulinski
Copy link
Contributor

@nrb how would Ralph test your PR? Can we get him a container & client build to test with?

@nrb
Copy link
Contributor

nrb commented Jan 15, 2019

My process was this:

  1. Create an nginx example application from our included YAML.
  2. Backup the nginx-example namespace
  3. Delete the nginx-example namespace and immediately issue a restore command (kubectl delete namespace nginx-example --wait=false; ark restore create <mybackup>)
  4. The namespace, pvc, and pv should all be restored successfully. Previously, these objects would be marked as duplicates in the restore, since they existed when Ark checked, but they would then delete soon after.

@rosskukulinski
Copy link
Contributor

@nrb can you build a container image and push somewhere for @rbankston to test with? Even just on dockerhub somewhere?

@rbankston
Copy link

@rosskukulinski got the container built thanks to great docs and verified that I'm able to restore something still in the process of deleting without errors as expected. Thanks for the steps @nrb.

@rosskukulinski
Copy link
Contributor

Awesome! That's great to hear @rbankston. Thanks for lending a hand with the testing.

@nrb nrb modified the milestones: v0.10.1, v0.11.0 Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement/User End-User Enhancement to Velero
Projects
None yet
Development

No branches or pull requests

6 participants