-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Longhorn Snapshots are not deleted after expired Backups (Velero) #6179
Comments
Is the |
@ywk253100 yes the |
Firstly, there are couple of points I would like to highlight about your setup:
The value "bak" tells Longhorn driver to do actual "backup" when a CSI snapshot is taken. This was the default behavior of Longhorn CSI driver until version 1.3. Since then, there is a different value you can use called "snap". This causes CSI driver to take a real "snapshot" without triggering data movement. Just wanted to mention it in case you want to use this feature. See https://longhorn.io/docs/1.4.1/snapshots-and-backups/csi-snapshot-support/csi-volume-snapshot-associated-with-longhorn-snapshot/ for details. Now, coming to the actual snapshot deletion, if VolumeSnapshot and VolumeSnapshotContent resources are gone and if storage snapshots remain, most probable cause would be an issue with CSI driver. You should check Longhorn CSI driver logs and verify if there are any messages corresponding to the VolumeSnapshotContent that was deleted. You can also try to reproduce the problem by creating a VolumeSnapshot manually and then deleting it to see what happens. We, at CloudCasa, have seen snapshot deletion issues with Longhorn but the driver version was pre-1.3. You use 1.4.1? Thanks, Raghu (https://cloudcasa.io). |
@draghuram Thanks for your tipps! 👍🏽 The same issue occurs when I creating a VolumeSnapshot manually and then deleting it. In the logs I can't find any useful informations. VolumeSnapshotClass apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: longhorn
namespace: longhorn-system
labels:
velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
type: bak VolumeSnapshot apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: new-snapshot-test
namespace: harbor
spec:
volumeSnapshotClassName: longhorn
source:
persistentVolumeClaimName: harbor-jobservice Remove the
We use Longhorn v1.4.1 and the velero-plugin-for-csi:v0.5.0. |
Interesting. From the logs, it does seem that deletion logic is kicking in and I even see the attempt to remove finalizer. Can you post VolumeSnapshot yaml after the deletion? I want to see what finalizers are listed there. |
@draghuram When I create a backup from a velero schedule I can't see any Here is my backup schedule: apiVersion: velero.io/v1
kind: Schedule
metadata:
name: harbor-daily-0200
namespace: velero #Must be the namespace of the Velero server
spec:
schedule: 0 0 * * * #IMPORTANT: Velero Pod has UTC time so CH-Time -2h
template:
includedNamespaces:
- 'harbor'
includedResources:
- '*'
snapshotVolumes: true
storageLocation: minio
volumeSnapshotLocations:
- longhorn
ttl: 168h0m0s #7 Days retention
defaultVolumesToRestic: false
hooks:
resources:
- name: postgresql
includedNamespaces:
- 'harbor'
includedResources:
- pods
excludedResources: []
labelSelector:
matchLabels:
statefulset.kubernetes.io/pod-name: harbor-database-0
pre:
- exec:
container: database
command:
- /bin/bash
- -c
- "psql -U postgres -c \"CHECKPOINT\";"
onError: Fail
timeout: 30s |
Today I upgrade Longhorn from v1.4.1 to v1.4.2 and the issue still occurs. 😔😔 |
Today I noticed that I have deployed a snapshot controller like described in this documentation. VolumeSnapshotClass apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: longhorn
namespace: longhorn-system
labels:
velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
type: bak VolumeSnapshot apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: new-snapshot-test
namespace: harbor
spec:
volumeSnapshotClassName: longhorn
source:
persistentVolumeClaimName: harbor-jobservice Remove the The same issue occurs but there is some interessting logs: |
After I upgrade to the newest RKE2 Helm charts, the error logs mentioned above "finalizers" no longer appears, but the issue still occurs. I upgrade the following Helm releases:
Here the log messages:
Following error looks interesting:
|
@draghuram do you have any idea what could be the root cause of the issue? |
Yes, that error message looks interesting. Following that line are these two lines:
So I guess the driver is getting the request to delete the snapshot though nothing else seem to happen. I am going to do the same test and see what other logs are produced by the driver. In the meanwhile, can you use "type: snap" in volume snapshot class and re-do the test? I must also note that the main problem appears to be in either CSI driver or Longhorn itself. From Velero's point of view, it is issuing the delete request. So you may have better luck pursuing this in Longhorn forums by describing how you deleted VolumeSnapshot class and how it didn't delete Longhorn snapshot. |
@draghuram thanks with |
@R-Studio, in our tests, we see that Longhorn snapshots are deleted as expected. You should use "type: snap" and then decide what type of Velero backups you want. The most basic is snapshots which you are already doing. The second option is do file system backups which will transfer the contents of Longhorn PVs to object storage. But they read data from live PV. A new feature is coming in 1.12 (slated to be released by end of August) that will take PV snapshot first and then backup data in the snapshot to object storage. |
@draghuram thanks for your help, but as I said, with |
1. With type: bak, there is no "CSI snapshot" of a longhorn volume, as
conventionally defined. Longhorn uses this standard request to instead run
its own volume backup to S3.
2. With type: snap, longhorn csi driver implements the traditional CSI
snapshot, but it isn't running a backup to S3.
3. Velero's own FSB ignores longhorn CSI mechanisms altogether and backs up
to S3.
4. Velero 1.12's new data mover could help with mounting a snapshot created
through type: snap and run a backup to S3.
5. Unfortunately, Velero 1.12 functionality is not ideal either because
longhorn doesn't support mounting a snapshot without initiating a full data
retrieval process behind the scenes.
…On Mon, Aug 7, 2023 at 2:41 AM Robin Hermann ***@***.***> wrote:
@draghuram <https://github.com/draghuram> thanks for your help, but as I
said, with type: snap velero only triggers a Longhorn snapshot and not a
Longhorn backup. The difference is that a longhorn snapshot is only stored
locally and a longhorn backup is written to a s3 bucket (external).
If I understand Velero, the CSI plugin and Longhorn correctly, I would
expect Velero to store the Kubernetes manifests (YAML's), create CSI
snapshots through the plugin and Longhorn to notice the CSI snapshot and
create the Longhorn backup.
Or have I misunderstood something?
—
Reply to this email directly, view it on GitHub
<#6179 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGFI2VD7CMP4GA4KSSQYM5LXUCEYVANCNFSM6AAAAAAXJVIAVU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I update to Velero v1.12.2, using the velero-plugin-for-csi v0.6.2 & velero-plugin-for-aws v1.8.2, but still the same issue. |
@Satsank thanks for your inputs. Point 5 catched my attention. I'm having the same issue and tested Velero 1.12.2 data mover as a workaround. Overall goal is to reduce data consumption on Longhorn by moving snapshot/backup data directly to S3. My tests failed because Velero wasn't able to mount the snapshots, see error below. My assumption is that this is because Longhorn snapshots are incremental, can therefore not be mounted just like that. Is there a way to make this work with Longhorn?
|
@lucatr PVCs can be created from Longhorn snapshots and mounted on data mover pods. We do something similar in CloudCasa and it works. However, the point is that when a PVC is created from Longhorn snapshot, Longhorn creates a new volume and "copies" data from snapshot which is very inefficient because the PVC is going to be deleted as soon as backup is done. The best one can do is to configure minimum replicas (1 really) and minimize copying. CloudCasa does provide this option but I really hope that Longhorn optimizes creation of volumes from snapshots by totally eliminating copy. Having said that, the error in your case seems to be different so it may be better to open a separate issue. When exactly are you seeing this error? Attaching Velero backup bundle may be useful. Feel free to contact is at CloudCasa as we have lot of experience with Longhorn PV backups. |
@draghuram thanks for the feedback. When I kick off the Velero Backup from schedule [1], a snapshot is created successfully, status changes to "ready to use". PVC is created as well, events say "successfully provisioned". PV looks fine as well, same is true for the volume in Longhorn GUI (says healthy, ready). But the snapshot-exposer pods are stuck in ContainerCreating status. Pod events show the same error about mounting issues I also see in the longhorn-csi-plugin pod logs. It's stuck like this for about 30 min, before pods are killed and backup is marked as PartiallyFailed in Velero. Not sure what the Velero backup bundle is, or what other logs might be interesting in this case. As suggested I'll go ahead and create a separate issue for this later this week. [1]
|
Hi everyone! I am from the Longhorn team. It is a great discussion so far in this thread and I would like to join the conversation. First of all, like others already mentioned, a CSI VolumeSnapshot (this is a Kubernetes upstream CRD) can be associated with either a Longhorn snapshot (live inside the cluster data) or a Longhorn backup (live outside of the cluster in S3 endpoint). For example, the CSI VolumeSnapshot created by this VolumeSnapshotClass corresponds to a Longhorn snapshot (link):
and the CSI VolumeSnapshot created by this VolumeSnapshotClass corresponds to a Longhorn backup (link):
CSI VolumeSnapshot of
|
Hi @PhanLe1010, Thanks for detailed information. It is very helpful. I personally think it is better not to use "type: bak" snapshots as a way of backup because this is Longhorn specific. One may easily have multiple clusters/CSI drivers and ideally, you need a unified backup strategy (such as the one provided by Velero or CloudCasa) that works across different storage types. In that respect, all you need from the storage is an efficient way to snapshot PVs and also create PVs from snapshots. I think Longhorn already took a step in this direction by implementing "true" snapshots (starting from 1.3). It will be nice if copy can be avoided when a PVC is created from snapshot but it looks like it is not in the roadmap? |
@PhanLe1010 Thanks for your great explaination.
|
Hi @draghuram, I see your point about the unified backup strategy 👍. This becomes a choice for users to choose between a unified solution and a native solution with the tradeoff between convenience and performance. Btw, can this unified backup strategy backup/restore a volume in block mode currently?
This will require a big effort from the Longhorn side. Could you create a GitHub ticket at https://github.com/longhorn/longhorn/issues/new/choose so that Longhorn PM can evaluate whether they want to proceed |
Hi @R-Studio
I think there is a bit of misunderstanding here. When using the
You can create a GitHub ticket at https://github.com/longhorn/longhorn/issues/new/choose. The idea for the improvement may be: |
@PhanLe1010 Anyway thanks for the hint, now we are able to save more storage space. |
Hi @R-Studio
Sorry for the mistake! Yes, it should be snapshot-delete recurring job instead of snapshot-cleanup recurring job
I see the confusion. Yes, the volume detail page (the first picture) only shows Longhorn backups that have an existing Longhorn snapshot. The backup page (the second picture) shows all Longhorn backups |
@PhanLe1010 Yes, Velero does support backups of BLOCK type PVs (CloudCasa contributed code for that recently). I will open a github request for copy-less creation of PVCs from Longhorn snapshots. |
Awesome! Thanks @draghuram ! |
Hi @draghuram Have you create the ticket on Longhorn repo yet? |
Just opened the feature request: longhorn/longhorn#7794. |
Describe the bug (🐛 if you encounter this issue)
We are using Velero to create backups from the Kubernetes manifests and the persistent volumes (in our example we backup Harbor).
If we create a backup, Velero saves the K8s manifests to a Object Storage (MinIO) and creates snapshots resources to trigger Longhorn backups with the
velero-plugin-for-csi
. Longhorn writes the backups to another MinIO bucket.If we delete a Velero backup or the backup is expired, the snapshot (
snapshots.longhorn.io
) are not deleted:We are using Velero v1.9.4 with
EnableCSI
feature and the following plugins:We have the same issue in Velero v1.11.0 with
EnableCSI
feature and the following plugins:To Reproduce
Steps to reproduce the behavior:
Schedule
below):velero backup create --from-schedule harbor-daily-0200
velero backup delete <BACKUPNAME>
snapshots.longhorn.io
) is not deleted.Expected behavior
The snapshot is deleted.
Environment
velero client config get features
):Additional context
Velero Backup Schedule for Harbor
VolumeSnapshotClass
VolumeSnapshotClass
In our second cluster, with Velero v1.11.0 installed, we created the following resource (but same issue here):
VolumeSnapshotLocation
The text was updated successfully, but these errors were encountered: