Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero schedule is replaced on every helm upgrade #298

Closed
branttaylor opened this issue Sep 8, 2021 · 25 comments · Fixed by #490
Closed

Velero schedule is replaced on every helm upgrade #298

branttaylor opened this issue Sep 8, 2021 · 25 comments · Fixed by #490
Assignees
Labels
bug Something isn't working velero

Comments

@branttaylor
Copy link

What steps did you take and what happened:

Every time we run a helm upgrade in CI, the schedule is replaced due to this:
https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/templates/schedule.yaml#L12
This causes Velero to run a new backup every time our CI pipeline runs.

What did you expect to happen:

The schedule should not be replaced every time, or there should at least be a way to prevent a new backup running when a new schedule resource is created.

Environment:

  • helm version (use helm version): 3.3.0
  • helm chart version and app version (use helm list -n <YOUR NAMESPACE>): chart: velero-2.23.6 app version: 1.6.3
  • Kubernetes version (use kubectl version): 1.20.9
  • Kubernetes installer & version: RKE 1.2.11
  • Cloud provider or hardware configuration: vSphere 6.7
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.2 LTS
@jenting jenting added the velero label Sep 9, 2021
@jenting
Copy link
Collaborator

jenting commented Sep 9, 2021

In my perspective, it should be fixed in Velero core, ref to vmware-tanzu/velero#1980.
After that, every time when a new Schedule CR setting, Velero won't trigger a new backup immediately unless the CronJob setting is triggered.

@jenting jenting added the wontfix This will not be worked on label Sep 9, 2021
@branttaylor
Copy link
Author

Thank you @jenting, I see the referenced issue is nearly 2 years old. Hopefully it gets resolved soon.

@jenting
Copy link
Collaborator

jenting commented Sep 9, 2021

Thank you @jenting, I see the referenced issue is nearly 2 years old. Hopefully it gets resolved soon.

I saw it be scheduled to v1.8.0, hope it'll be fixed along with that release version.

@javanthropus
Copy link

I think the fact that the schedule resource is deleted and then re-created is actually causing loss of backup resources that are tied to them when using the useOwnerReferencesInBackup setting of the schedule under ArgoCD. In this case, the backup's owner is deleted, and if that alone is not enough to trigger the removal of the backup resource, Argo would then identify the backup as something to be managed, see that the chart itself does not create such a resource, and then prune the resource.

In other words, the identified helm hook does not make sense for schedule resources. Why is that being done in the first place?

@jenting
Copy link
Collaborator

jenting commented Sep 23, 2021

In other words, the identified helm hook does not make sense for schedule resources. Why is that being done in the first place?

The purpose we want to achieve before is that the helm upgrade did not upgrade the CRD manifest.
It makes us have lots of issues related to it that the user uses old CRD but with new Velero core or CR.

After digging the helm doc, the best way we could think of is to create a job to upgrade the CRD.
And makes sure the job is finished and applies the Backups/Restores/Schedules CR into the cluster.
We did not notice that the Backups/Restores/Schedules CR would be recreated before.

As far as I know, the ArgoCD did some enhancement on managed the helm chart. We did hit some issues before. However, it's really hard to make the chart compatible with to native helm command and GitOps tool (ArgoCD, FluxCD, etc...). We could do is focus on the native helm command work correctly and verify it on the CI pipeline.

@javanthropus
Copy link

ArgoCD, unlike helm, will perform CRD upgrades unconditionally. It uses helm template to render all the manifests in order to apply them itself, so it doesn't have helm's CRD upgrade limitation.

Probably the only way to handle this for both native helm and ArgoCD is to offer an additional chart that only supplies the CRDs as regular templates: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#method-2-separate-charts. I think doing this and making that new chart a dependency of this one would solve the CRD upgrade issue without the need to use hooks at all.

@jenting
Copy link
Collaborator

jenting commented Sep 23, 2021

Thanks for the suggestion and we did consider separate the chart before.
But I can't remember the main reason on why we did not choose this solution before.

Anyway, for this chart we would want to make sure helm works well. But for GitOps tool, we could try our best to support it.

@javanthropus
Copy link

Regardless of the method used to operate helm, the current hook will cause backups associated with a schedule to be deleted when the chart is re-applied for any reason if the useOwnerReferencesInBackup setting is included. We need to use that setting to prevent ArgoCD from nuking the backups itself, so we're stuck with building a wrapper chart that provides the schedule resources itself without any of the hooks defined.

Would it be possible to add a feature that allows disabling all the hooks used to work around helm's CRD management issues? That would satisfy our needs for using this chart with ArgoCD and should be pretty simple to implement without risking the current behavior for helm native users.

@jenting
Copy link
Collaborator

jenting commented Sep 23, 2021

Sure, we did have that before but after implementing the upgrade CRD job we remove the setting. We could add it back.

@javanthropus
Copy link

That would be great! Thanks.

@jenting jenting added enhancement New feature or request and removed wontfix This will not be worked on labels Sep 25, 2021
@jenting jenting added the good first issue Good for newcomers label Oct 13, 2021
@jsnider-mtu
Copy link

I've just run into this issue and now have backups that exist in S3 but the owner ref has a different UID than the newly created schedule, causing Velero to continuously find them and attempt to sync. Is there anyway to rescue those backups or is that a lost cause?

@jsnider-mtu
Copy link

I was able to modify the velero-backup.json file in S3 and change the UID of the owner schedule to match the newly created one. They are now syncing correctly.

@lucagouty
Copy link

lucagouty commented Mar 29, 2022

EDIT: ok now I think I get it, those annotations are just to be sure we create those resources after upgrade right? So CRDS and other things already exists. But so far it didn't cause any issue for me to remove it.


To be honest I don't see the link with vmware-tanzu/velero#1980

Afaik the recreation of Schedule on each helm upgrade is because of the 2 hooks annotations, and that fix on Velero core won't change that : as long as those annotations are in the chart Schedule will be deleted on every helm upgrade.

When using helm upgrade --debug I can see those 2 annotations trigger deletions for BackupStorageLocation, VolumeSnapshotLocation and Schedule but I've no idea why is that necessary :

client.go:299: [debug] Starting delete for "default" BackupStorageLocation
client.go:128: [debug] creating 1 resource(s)
client.go:299: [debug] Starting delete for "default" VolumeSnapshotLocation
client.go:128: [debug] creating 1 resource(s)
client.go:299: [debug] Starting delete for "velero-xxx-daily" Schedule
client.go:128: [debug] creating 1 resource(s)
...

So anyway here's how I "fixed" it, but as I don't understand why there's those annotations I'm not sure it's a good idea 😓

I always encapsulated those kind of charts into my own :

# /Chart.yaml

apiVersion: v2
name: velero
description: A Helm chart to install a Velero server
version: 1.0.7
appVersion: 1.0.7
dependencies:
  - name: velero
    version: 2.29.1
    repository: https://vmware-tanzu.github.io/helm-charts

Then all I needed to do was copy templates/schedule.yaml into my own templates folder, remove the 2 hooks annotations and add schedules to the root of my values :

# /values.yaml

schedules:
   # Schedules created by the template without hooks annotations
   some-schedule:
      ...

velero:
   # Velero values, if you set schedules here it will be recreated on each upgrade
   ...

Now it doesn't recreates schedules on each helm upgrade.

If it's not a problem to remove those annotations, maybe the chart should allow to disable it conditionally? I don't remember seeing hooks annotations outside of jobs tbh

@javanthropus
Copy link

We ultimately ended up doing the same thing in a wrapper chart, @lucagouty.

@sotiriougeorge
Copy link

Wouldn't that be solved if the original chart was modified to allow (through its values.yaml) to modify the hooks for the Schedule resource? Sorry chart newbie here trying to see if my logic stands.

@javanthropus
Copy link

Yes. I requested that earlier in this conversation, and @jenting said that the feature could be added back in.

@machine424
Copy link
Contributor

Hello,

We also suffer from this "workaround". As it was suggested above, another solution to manage CRDs (we consider it to be simpler and less hacky) is to put them in their own Chart (as suggested by Helm https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#method-2-separate-charts), we have been doing it this way for all our CRDs for almost a year and it works well (we use Helmfile to ease the management of multiple releases)

We host velero's CRD Chart here https://github.com/wiremind/wiremind-helm-charts/tree/main/charts/velero-crds, if you want we can move it to this repo.

PS: We're trying to do the same for prometheus-operator's CRD prometheus-community/helm-charts#2612

@jenting
Copy link
Collaborator

jenting commented Nov 24, 2022

@machine424
Thank you for the feedback. I'm curious about the Helmfile that manages multiple releases.
Would you mind sharing the Helmfile information more?

@machine424
Copy link
Contributor

@jenting

For CRDs, we have a helmfile.yaml file that looks like

...
repositories:
  - name: wiremind
    url: https://wiremind.github.io/wiremind-helm-charts

templates:
  common: &common
    namespace: crds
    missingFileHandler: Error

releases:
  - name: velero-crds
    <<: *common
    chart: wiremind/velero-crds
    version: x.y.z
  - name: prometheus-operator-crds
    <<: *common
    chart: wiremind/prometheus-operator-crds
    version: x.y.z
... 

And then a CD Pod (that has only permissions to create/edit CRDs) runs a helmfile sync, this is how we manage all our CRDs.

(more about helmfile: https://github.com/helmfile/helmfile)

@qiuming-best
Copy link
Collaborator

qiuming-best commented Dec 1, 2022

Indeed, BackupStorageLocation, VolumeSnapshotLocation, and Schedule will be recreated when using helm install, upgrade, and rollback command, for the hooks: "helm.sh/hook": post-install,post-upgrade,post-rollback, For these hooks, I think it wants to keep one order recreated after install/upgrade/rollback.

For Velero, there is not matter whether these CRs are created before or after the Velero controller, if CRs are created before the Velero controller, the controller will list CRs and handle them. if CRs are created after the Velero controller, these CRs could still be handled by the watch resources mechanism. So the orders defined in the hook do not matter.

We want to Separate CRDs to avoid recreating CRs, and it's caused by hooks, in fact, we could remove them.

@LarsBingBong
Copy link

LarsBingBong commented Apr 18, 2023

Any news on this? Would be really awesome to fix. Fixing this would make it possible to avoid doing a lot of magic in regards to e.g. trying to remove the "helm.sh/hook": post-install,post-upgrade,post-rollback annotation with ytt.

@LarsBingBong
Copy link

So because of this issue in Helm Post-renderer not receiving chart hook manifests this issue can't event be worked around with e.g. ytt. Another VMWare backed project as you likely know.
So I would have to work around this in an somewhat "sad" way. By disabling the schedule in the Helm values file then create the schedule Velero cli. Sad because then more tools are in the air and more logic.

What's the holdup in regards to removing the helm.sh hook related annotations on Schedule, BackupStorageLocation and VolumeSnapshotLocation.

Thank you 💯

@jenting jenting added help wanted Extra attention is needed and removed good first issue Good for newcomers labels Aug 6, 2023
@jenting jenting added bug Something isn't working and removed enhancement New feature or request labels Aug 6, 2023
@jenting
Copy link
Collaborator

jenting commented Aug 8, 2023

Would it be possible to add a feature that allows disabling all the hooks used to work around helm's CRD management issues? That would satisfy our needs for using this chart with ArgoCD and should be pretty simple to implement without risking the current behavior for helm native users.

@javanthropus
Sorry for the late reply. This PR #487 add a toggle to disable all the hooks.

@javanthropus
Copy link

Thank you, @jenting. I've opened a ticket in my team's issue tracker to take a look at the new chart release and hopefully to start using it.

@jenting jenting self-assigned this Aug 13, 2023
@jenting jenting removed the help wanted Extra attention is needed label Aug 13, 2023
@jenting
Copy link
Collaborator

jenting commented Aug 14, 2023

@LarsBingBong @lucagouty @javanthropus @branttaylor
I created a PR to address it.
Would you mind to help me verify it? Appreciate 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working velero
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants