Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for delete and recreate option to ExistingResourcePolicy feature #6142

Open
shubham-pampattiwar opened this issue Apr 12, 2023 · 17 comments

Comments

@shubham-pampattiwar
Copy link
Collaborator

Describe the problem/challenge you have

ExistingResourcePolicy feature was implemented in velero 1.9 with only 2 options - none and update We would like to complete the phase 2 of this feature's implementation i.e also provide the delete and recreate option

Design link: #4613

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "The project would be better with this feature added"
  • 👎 for "This feature will not enhance the project in a meaningful way"
@sseago
Copy link
Collaborator

sseago commented Apr 12, 2023

Current location of the design is here: https://github.com/vmware-tanzu/velero/blob/main/design/Implemented/existing-resource-policy_design.md

Note that there are two aspects of the second implementation phase:

  1. add the recreate option (in addition to none and update
  2. implement the "approach 3" from the design -- add a second spec field, existingResourcePolicyOverrides to allow per-resource-type overrides to the otherwise-defined existingResourcePolicy

The second one is important here -- since recreate is potentially destructive (it's possible that velero will delete a cluster resource and be unable to recreate it), users may want to limit its use to resource types they know are safe. They may set existingResourcePolicy to update but set a recreate override for pods.

Also, as an implementation comment, I think we should only delete/recreate in cases where update fails. So even if policy is recreate, we first attempt update, and if that fails, we recreate.

@kaovilai
Copy link
Contributor

kaovilai commented May 4, 2023

👋

@reasonerjt
Copy link
Contributor

Per comments in the PR seems there's no consensus on the implementation, we may move this out of v1.12

@reasonerjt reasonerjt removed this from the v1.12 milestone Aug 2, 2023
@reasonerjt reasonerjt added 1.13-candidate issue/pr that should be considered to target v1.13 minor release and removed defer-candidate labels Aug 2, 2023
@reasonerjt reasonerjt added Needs triage We need discussion to understand problem and decide the priority Reviewed Q3 2023 Needs Design and removed Needs triage We need discussion to understand problem and decide the priority labels Aug 23, 2023
@kaovilai
Copy link
Contributor

So I do not think we have agreement yet on how to approach this. Don't think we'll have design in by feature freeze so let's move it out to 1.14?

@kaovilai
Copy link
Contributor

I can commit having an implementation before current feature complete if we have a design agreement in by first week of november.

@anshulahuja98
Copy link
Collaborator

@kaovilai would be great if we could have this feature shipped. Really looking forward to it.

My suggestion would be to do it in a NS override route for simplicity than at a resource level.

@kaovilai
Copy link
Contributor

do it in a NS override route

So user would specify which namespace this policy applies to? does it apply to all resources in this namespace?

@sseago
Copy link
Collaborator

sseago commented Oct 19, 2023

@anshulahuja98 I think we need resource-level control, since this is something that you'd probably only want for specific resources that:

  1. you expect update/patch to be problematic (immutable fields, etc)
  2. you know deleting and recreating won't break anything

Whether you also need namespace-level control is another question. We already have design work done in terms of proposing an api for resource-level control. I'm not sure namespace-scoped is any simpler than resource scoped from an impl/api point of view, it's just a different filter. Implementing both, however, would be more complicated.

Scoping by namespace seems less useful to me than by resource, thinking through the sorts of use cases where you'd need this feature.

@anshulahuja98
Copy link
Collaborator

The issue with specifying resource level leads to complexity in specifying the correlation between various resources and the ordering in which to delete. That is very hard for the end user to figure out which resources will face issues.

Think of it this way, even if we recommend an order without customer input, it's pretty hard to say with certainty it will pass with even Core API resources. The investigation and testing required would be complex. Think of this as what will be the default option in delete and rec, where a basic customer can just proceed.

On the other hand simply giving a way to delete namespace is much more likely to succeed IMO. For example basic scenarios such as PVC, PVC getting deleted due to NS will lead to PV deletion automatically.

@anshulahuja98
Copy link
Collaborator

Or if we want to say do at a resource level thing - maybe we should do it in 2 passes -> first trigger delete for all resources we will restore and then do actual restore.

If we do delete and restore 1 by 1, it is likely resources will stay stuck in deletion.

if we do full pass for delete before hand, it is likely that all dependent resources must have resolved deletion (most likely)

If we take an approach such as this, I am okay with going with it as compared to the NS approach.

Maybe @kaovilai can evalute the feasibiliyt of the 2 Pass approach in current setup

@anshulahuja98
Copy link
Collaborator

I'll be on vacation for some time, will revert back on Wednesday if there are any further discussions.

@sseago
Copy link
Collaborator

sseago commented Oct 19, 2023

@anshulahuja98 The problem with the first pass is that we only want to delete resources that fail the patch. So for resources that don't have immutable fields, we'd just patch as with Update -- Delete was really just intended for resources that fail to patch, so that's the biggest problem with the two pass approach. Second problem is that the two pass approach would require a significant change to restore logic. We'd essentially have to run the restore logic twice, calling all RIAs, etc, once without creating (only deleting), then again with creating. Calling RIAs would be required before the delete pass, because it's possible for a RIA to discard a resource -- so if we didn't call the RIA, we wouldn't know that, we'd delete it, but then the second pass with RIAs would discard it and not restore.

@sseago
Copy link
Collaborator

sseago commented Oct 19, 2023

But even if we're just doing the "every item per namespace" we still have the problem of dependent resources, since again we're only deleting things that we can't patch, which means there's a relatively small set of resources that will be deleted and recreated, and things could get stuck on deleting depending on how we implement this -- do we remove finalizers or not, for example. But this is where the resource-specific approach could actually reduce these errors, since we can not apply this new policy to resource types that are known to be problematic to delete and recreate (due to dependencies, etc.)

@kaovilai
Copy link
Contributor

kaovilai commented Oct 31, 2023

If we do delete and restore 1 by 1, it is likely resources will stay stuck in deletion.

if we do full pass for delete before hand, it is likely that all dependent resources must have resolved deletion (most likely)

If we take an approach such as this, I am okay with going with it as compared to the NS approach.

Maybe @kaovilai can evalute the feasibiliyt of the 2 Pass approach in current setup

I see where you're coming from with this. I may have been thinking from a very small subset of resource kinds with immutable fields such as pods. The expectation is you only specify the kind you know will delete/create cleanly. Perhaps this backup with this restore policy will contain only pods for instance.

But if thinking in terms of a backup with multiple resources then scoping by NS is more likely to succeed.

What if this option is only available if you restore a limited number of namespaces and you specify which resource it applies to? We can give an explicit example of pods being one that we know to needing this.

@reasonerjt reasonerjt added 2024 Q1 reviewed and removed 1.13-candidate issue/pr that should be considered to target v1.13 minor release labels Jan 24, 2024
@reasonerjt reasonerjt added this to the v1.14 milestone Feb 28, 2024
@kaovilai
Copy link
Contributor

kaovilai commented Mar 4, 2024

This issue might need re-titling after our discussions concluded that original issue may not be that useful.

https://kubernetes.slack.com/archives/C021GPR1L3S/p1709576872952979?thread_ts=1709329227.490779&cid=C021GPR1L3S

https://hackmd.io/KKWcslVSR3yysufvC76MHg#Solutions-for-usecase

@Elyytscha
Copy link

Hello just wanted to note that we ran into this while developing and testing our backup restore strategy

we had to delete persistent volume claims first from the namespace where we wanted to restore them to get them restored with the data we wanted to restore from the backup

so when the pvc's are still existent, data is broken, you do velero restore backup, you will not get your pvc data restored
we had to delete the pvc's first to achieve this

@kaovilai
Copy link
Contributor

you will not get your pvc data restored
we had to delete the pvc's first to achieve this

Expected. Your usecase should be covered by #7481

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants