[BUG] Longhorn Snapshots are not deleted after expired Backups (Velero) #6179

R-Studio · 2023-04-24T14:25:09Z

Describe the bug (🐛 if you encounter this issue)

We are using Velero to create backups from the Kubernetes manifests and the persistent volumes (in our example we backup Harbor).
If we create a backup, Velero saves the K8s manifests to a Object Storage (MinIO) and creates snapshots resources to trigger Longhorn backups with the velero-plugin-for-csi. Longhorn writes the backups to another MinIO bucket.
If we delete a Velero backup or the backup is expired, the snapshot (snapshots.longhorn.io) are not deleted:

We are using Velero v1.9.4 with EnableCSI feature and the following plugins:

velero/velero-plugin-for-csi:v0.4.0
velero/velero-plugin-for-aws:v1.6.0

We have the same issue in Velero v1.11.0 with EnableCSI feature and the following plugins:

velero/velero-plugin-for-csi:v0.5.0
velero/velero-plugin-for-aws:v1.6.0

To Reproduce

Steps to reproduce the behavior:

Install the newest version of Velero and Rancher-Longhorn
In Longhorn configre a S3 Backup Target (we are usng MinIO for this)
Enable CSI Snapshot Support for Longhorn.
Create a backup (for example with the Schedule below): velero backup create --from-schedule harbor-daily-0200
Delete the backup velero backup delete <BACKUPNAME>
The snapshot (snapshots.longhorn.io) is not deleted.

Expected behavior

The snapshot is deleted.

Environment

Longhorn version: 102.2.0+up1.4.1
Velero version:
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher-Longhorn Helm Chart
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2, v1.25.7+rke2r1
- Number of management node in the cluster: 1x
- Number of worker node in the cluster: 3x
Node config
- OS type and version: Ubuntu
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMs on Proxmox
Number of Longhorn volumes in the cluster: 17
Velero features (use velero client config get features):

Additional context

Velero Backup Schedule for Harbor

---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: harbor-daily-0200
  namespace: velero #Must be the namespace of the Velero server
spec:
  schedule: 0 0 * * *
  template:
    includedNamespaces:
    - 'harbor'
    includedResources:
    - '*'
    snapshotVolumes: true
    storageLocation: minio
    volumeSnapshotLocations:
      - longhorn
    ttl: 168h0m0s #7 Days retention
    defaultVolumesToRestic: false
    hooks:
      resources:
        - name: postgresql
          includedNamespaces:
          - 'harbor'
          includedResources:
          - pods
          excludedResources: []
          labelSelector:
            matchLabels:
              statefulset.kubernetes.io/pod-name: harbor-database-0
          pre:
            - exec:
                container: database
                command:
                  - /bin/bash
                  - -c
                  - "psql -U postgres -c \"CHECKPOINT\";"
                onError: Fail
                timeout: 30s

VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: driver.longhorn.io
deletionPolicy: Delete

VolumeSnapshotClass

In our second cluster, with Velero v1.11.0 installed, we created the following resource (but same issue here):

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: bak

VolumeSnapshotLocation

apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: longhorn
  namespace: velero
spec:
  provider: longhorn.io/longhorn

The text was updated successfully, but these errors were encountered:

R-Studio · 2023-04-24T14:35:53Z

Maybe this issue is related to:

Can't delete snapshot after velero backup delete #4383
Velero is not deleting CSI snapshots when it removes the backup after retention period #3465
[BUG] Scheduled backups didn't complete and leave tons of snapshots behind longhorn/longhorn#2029

ywk253100 · 2023-04-26T09:59:43Z

Is the VolumeSnapshot and VolumeSnapshotContent removed after the backup being deleted? I'm feeling this is more likely a bug of the csi driver rather than Velero

R-Studio · 2023-05-01T07:28:58Z

@ywk253100 yes the VolumeSnapshot and VolumeSnapshotContent removed after backup deleted.
If it is more like a bug of the csi driver, is there a way to address this?

draghuram · 2023-05-01T15:14:17Z

Firstly, there are couple of points I would like to highlight about your setup:

I see that the VolumeSnapshotClass's "deletionPolicy" is set to "Delete". This is dangerous because if the namespace is deleted or if VolumeSnapshot resources are deleted, it will trigger deletion of VolumeSnapshotContent and that of storage snapshot itself. This is probably not what you intend so it is advisable to set "deletionPolicy" to "Retain". Note that this will not prevent Velero from cleaning up the snapshots when backup expires.
I see the following Longhorn specific config in Volume snapshot class:

parameters:
  type: bak

The value "bak" tells Longhorn driver to do actual "backup" when a CSI snapshot is taken. This was the default behavior of Longhorn CSI driver until version 1.3. Since then, there is a different value you can use called "snap". This causes CSI driver to take a real "snapshot" without triggering data movement. Just wanted to mention it in case you want to use this feature. See https://longhorn.io/docs/1.4.1/snapshots-and-backups/csi-snapshot-support/csi-volume-snapshot-associated-with-longhorn-snapshot/ for details.

Now, coming to the actual snapshot deletion, if VolumeSnapshot and VolumeSnapshotContent resources are gone and if storage snapshots remain, most probable cause would be an issue with CSI driver. You should check Longhorn CSI driver logs and verify if there are any messages corresponding to the VolumeSnapshotContent that was deleted. You can also try to reproduce the problem by creating a VolumeSnapshot manually and then deleting it to see what happens. We, at CloudCasa, have seen snapshot deletion issues with Longhorn but the driver version was pre-1.3. You use 1.4.1?

Thanks, Raghu (https://cloudcasa.io).

R-Studio · 2023-05-15T07:01:52Z

@draghuram Thanks for your tipps! 👍🏽

The same issue occurs when I creating a VolumeSnapshot manually and then deleting it. In the logs I can't find any useful informations.

VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: bak

VolumeSnapshot

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: new-snapshot-test
  namespace: harbor
spec:
  volumeSnapshotClassName: longhorn
  source:
    persistentVolumeClaimName: harbor-jobservice

Remove the VolumeSnapshot: kubectl delete volumesnapshot new-snapshot-test -n harbor

Why does Longhorn remain the snapshot?
Why does it work in the GUI but not without it?

We use Longhorn v1.4.1 and the velero-plugin-for-csi:v0.5.0.

R-Studio · 2023-05-15T09:50:47Z

I still found something interesting in the snapshot-controller logs:

But in the logs there is no error for deleting the snapshot:

draghuram · 2023-05-15T17:18:36Z

Interesting. From the logs, it does seem that deletion logic is kicking in and I even see the attempt to remove finalizer. Can you post VolumeSnapshot yaml after the deletion? I want to see what finalizers are listed there.

R-Studio · 2023-05-22T06:18:02Z

@draghuram When I create a backup from a velero schedule I can't see any VolumeSnapshots beeing created, respectively the VolumeSnaphots are deleted directly after successful backup:

Here is my backup schedule:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: harbor-daily-0200
  namespace: velero #Must be the namespace of the Velero server
spec:
  schedule: 0 0 * * * #IMPORTANT: Velero Pod has UTC time so CH-Time -2h
  template:
    includedNamespaces:
    - 'harbor'
    includedResources:
    - '*'
    snapshotVolumes: true
    storageLocation: minio
    volumeSnapshotLocations:
      - longhorn
    ttl: 168h0m0s #7 Days retention
    defaultVolumesToRestic: false
    hooks:
      resources:
        - name: postgresql
          includedNamespaces:
          - 'harbor'
          includedResources:
          - pods
          excludedResources: []
          labelSelector:
            matchLabels:
              statefulset.kubernetes.io/pod-name: harbor-database-0
          pre:
            - exec:
                container: database
                command:
                  - /bin/bash
                  - -c
                  - "psql -U postgres -c \"CHECKPOINT\";"
                onError: Fail
                timeout: 30s

R-Studio · 2023-06-05T06:39:31Z

Today I upgrade Longhorn from v1.4.1 to v1.4.2 and the issue still occurs. 😔😔

R-Studio · 2023-06-06T12:45:52Z

Today I noticed that I have deployed a snapshot controller like described in this documentation.
Although I already had an snapshot controller "rke2-snapshot-controller" on my cluster. I am not sure if this is comes with a rancher update or something. Anyway I removed my snapshot-controller and test the issue again.

VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: bak

VolumeSnapshot

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: new-snapshot-test
  namespace: harbor
spec:
  volumeSnapshotClassName: longhorn
  source:
    persistentVolumeClaimName: harbor-jobservice

Remove the VolumeSnapshot: kubectl delete volumesnapshot new-snapshot-test -n harbor

The same issue occurs but there is some interessting logs:

After I created the VolumeSnapshot the following logs are written:

R-Studio · 2023-06-06T12:46:15Z

After I upgrade to the newest RKE2 Helm charts, the error logs mentioned above "finalizers" no longer appears, but the issue still occurs.

I upgrade the following Helm releases:

rke2-snapshot-controller: 1.7.201 -> 1.7.202
rke2-snapshot-controller-crd: 1.7.201 -> 1.7.202
rke2-snapshot-validation.webhook: 1.7.200 -> 1.7.201

Here the log messages:

2023-06-06T14:36:39+02:00	I0606 12:36:39.815026       1 snapshot_controller_base.go:213] deletion of content "snapcontent-20d81f05-864b-489e-8875-3ea71832a743" was already processed
2023-06-06T14:36:38+02:00	E0606 12:36:38.814935       1 snapshot_controller_base.go:265] could not sync content "snapcontent-20d81f05-864b-489e-8875-3ea71832a743": snapshot controller failed to update snapcontent-20d81f05-864b-489e-8875-3ea71832a743 on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-20d81f05-864b-489e-8875-3ea71832a743": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshotcontents/snapcontent-20d81f05-864b-489e-8875-3ea71832a743, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: d376de84-4aa4-426e-8c97-f96df3073b73, UID in object meta: 
2023-06-06T14:36:38+02:00	time="2023-06-06T12:36:38Z" level=info msg="DeleteSnapshot: rsp: {}"
2023-06-06T14:36:38+02:00	time="2023-06-06T12:36:38Z" level=info msg="DeleteSnapshot: req: {\"snapshot_id\":\"bak://pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c/backup-3e65b0ef494940b8\"}"
2023-06-06T14:35:12+02:00	I0606 12:35:12.301474       1 snapshot_controller.go:998] checkandRemovePVCFinalizer[new-snapshot-test]: Remove Finalizer for PVC harbor-jobservice as it is not used by snapshots in creation
2023-06-06T14:35:12+02:00	I0606 12:35:12.296417       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshot", Namespace:"harbor", Name:"new-snapshot-test", UID:"20d81f05-864b-489e-8875-3ea71832a743", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"181825436", FieldPath:""}): type: 'Normal' reason: 'SnapshotReady' Snapshot harbor/new-snapshot-test is ready to use.
2023-06-06T14:35:12+02:00	I0606 12:35:12.296353       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshot", Namespace:"harbor", Name:"new-snapshot-test", UID:"20d81f05-864b-489e-8875-3ea71832a743", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"181825436", FieldPath:""}): type: 'Normal' reason: 'SnapshotCreated' Snapshot harbor/new-snapshot-test was successfully created by the CSI driver.
2023-06-06T14:35:12+02:00	time="2023-06-06T12:35:12Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1686054902},\"ready_to_use\":true,\"size_bytes\":1073741824,\"snapshot_id\":\"bak://pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c/backup-3e65b0ef494940b8\",\"source_volume_id\":\"pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c\"}}"
2023-06-06T14:35:12+02:00	time="2023-06-06T12:35:12Z" level=debug msg="ControllerServer CreateSnapshot rsp: snapshot:<size_bytes:1073741824 snapshot_id:\"bak://pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c/backup-3e65b0ef494940b8\" source_volume_id:\"pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c\" creation_time:<seconds:1686054902 > ready_to_use:true > "
2023-06-06T14:35:12+02:00	time="2023-06-06T12:35:12Z" level=info msg="createCSISnapshotTypeLonghornBackup: volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c backup backup-3e65b0ef494940b8 of snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743 in progress"
2023-06-06T14:35:12+02:00	time="2023-06-06T12:35:12Z" level=info msg="Backup backup-3e65b0ef494940b8 initiated for volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c for snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743"
2023-06-06T14:35:08+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:08Z" level=info msg="Done initiating backup creation, received backupID: backup-3e65b0ef494940b8"
2023-06-06T14:35:06+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:06Z" level=info msg="Loaded driver for s3://t1-longhorn-snapshots@minio/" pkg=s3
2023-06-06T14:35:06+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:06Z" level=info msg="Start creating backup backup-3e65b0ef494940b8" pkg=backup
2023-06-06T14:35:06+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:06Z" level=info msg="Initializing backup backup-3e65b0ef494940b8 for volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743" pkg=backup
2023-06-06T14:35:06+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:06Z" level=info msg="Backing up snapshot-20d81f05-864b-489e-8875-3ea71832a743 on tcp://10.42.1.154:10060, to s3://t1-longhorn-snapshots@minio/"
2023-06-06T14:35:06+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:06Z" level=info msg="Backing up snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743 to backup backup-3e65b0ef494940b8" serviceURL="10.42.2.8:10009"
2023-06-06T14:35:04+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:04Z" level=info msg="Done initiating backup creation, received backupID: backup-3e65b0ef494940b8"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Loaded driver for s3://t1-longhorn-snapshots@minio/" pkg=s3
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Start creating backup backup-3e65b0ef494940b8" pkg=backup
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Initializing backup backup-3e65b0ef494940b8 for volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743" pkg=backup
2023-06-06T14:35:02+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:02Z" level=info msg="Backing up snapshot-20d81f05-864b-489e-8875-3ea71832a743 on tcp://10.42.1.154:10060, to s3://t1-longhorn-snapshots@minio/"
2023-06-06T14:35:02+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:02Z" level=info msg="Backing up snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743 to backup backup-3e65b0ef494940b8" serviceURL="10.42.2.8:10009"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="createCSISnapshotTypeLonghornBackup: volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c initiating backup for snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished snapshot" snapshot=snapshot-20d81f05-864b-489e-8875-3ea71832a743 volume=pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished to snapshot: 10.42.3.102:10105 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Finished to snapshot: 10.42.1.154:10060 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Removing disk volume-head-004.img"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished creating disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-99a60236/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk checksum file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-99a60236/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.checksum before linking"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-46492401] time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk metadata file path /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-99a60236/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.meta before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Removing disk volume-head-004.img"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished creating disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-5286f5ef/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk checksum file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-5286f5ef/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.checksum before linking"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk metadata file path /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-5286f5ef/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.meta before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Starting to create disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Finished to snapshot: 10.42.2.93:10045 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-46492401] time="2023-06-06T12:35:02Z" level=info msg="Replica server starts to snapshot [snapshot-20d81f05-864b-489e-8875-3ea71832a743] volume, user created true, created time 2023-06-06T12:35:02Z, labels map[type:bak]"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Removing disk volume-head-003.img"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished creating disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-90ade5bb/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk checksum file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-90ade5bb/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.checksum before linking"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-ff57e501] time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk metadata file path /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-90ade5bb/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.meta before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Starting to create disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Replica server starts to snapshot [snapshot-20d81f05-864b-489e-8875-3ea71832a743] volume, user created true, created time 2023-06-06T12:35:02Z, labels map[type:bak]"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Starting to create disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Starting to snapshot: 10.42.1.154:10060 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Starting to snapshot: 10.42.2.93:10045 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-ff57e501] time="2023-06-06T12:35:02Z" level=info msg="Replica server starts to snapshot [snapshot-20d81f05-864b-489e-8875-3ea71832a743] volume, user created true, created time 2023-06-06T12:35:02Z, labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Starting to snapshot: 10.42.3.102:10105 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Requesting system sync before snapshot" snapshot=snapshot-20d81f05-864b-489e-8875-3ea71832a743 volume=pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Starting snapshot" snapshot=snapshot-20d81f05-864b-489e-8875-3ea71832a743 volume=pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c
2023-06-06T14:35:02+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:02Z" level=info msg="Snapshotting volume: snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743" serviceURL="10.42.2.8:10009"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="createCSISnapshotTypeLonghornBackup: volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c initiating snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-20d81f05-864b-489e-8875-3ea71832a743\",\"parameters\":{\"type\":\"bak\"},\"source_volume_id\":\"pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c\"}"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.4.2\"}"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="GetPluginInfo: req: {}"
2023-06-06T14:35:02+02:00	I0606 12:35:02.112599       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshot", Namespace:"harbor", Name:"new-snapshot-test", UID:"20d81f05-864b-489e-8875-3ea71832a743", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"181825428", FieldPath:""}): type: 'Normal' reason: 'CreatingSnapshot' Waiting for a snapshot harbor/new-snapshot-test to be created by the CSI driver.
2023-06-06T14:35:02+02:00	I0606 12:35:02.112195       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-20d81f05-864b-489e-8875-3ea71832a743 through the plugin ...
2023-06-06T14:35:02+02:00	I0606 12:35:02.106794       1 snapshot_controller.go:919] Added protection finalizer to persistent volume claim harbor/harbor-jobservice
2023-06-06T14:35:02+02:00	I0606 12:35:02.093901       1 snapshot_controller.go:638] createSnapshotContent: Creating content for snapshot harbor/new-snapshot-test through the plugin ...
2023-06-06T14:35:01+02:00	time="2023-06-06T12:35:01Z" level=debug msg="Setting allow-recurring-job-while-volume-detached is false

Following error looks interesting:

snapshot_controller_base.go:265] could not sync content "snapcontent-20d81f05-864b-489e-8875-3ea71832a743": snapshot controller failed to update snapcontent-20d81f05-864b-489e-8875-3ea71832a743 on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-20d81f05-864b-489e-8875-3ea71832a743": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshotcontents/snapcontent-20d81f05-864b-489e-8875-3ea71832a743, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: d376de84-4aa4-426e-8c97-f96df3073b73, UID in object meta:`

R-Studio · 2023-07-17T06:29:56Z

@draghuram do you have any idea what could be the root cause of the issue?

draghuram · 2023-07-17T18:16:26Z

Yes, that error message looks interesting. Following that line are these two lines:

2023-06-06T14:36:38+02:00       time="2023-06-06T12:36:38Z" level=info msg="DeleteSnapshot: rsp: {}"

2023-06-06T14:36:38+02:00       time="2023-06-06T12:36:38Z" level=info msg="DeleteSnapshot: req: {\"snapshot_id\":\"bak://pvc-21b
f4b76-ac60-48d9-b5ba-fe15b50dd87c/backup-3e65b0ef494940b8\"}"

So I guess the driver is getting the request to delete the snapshot though nothing else seem to happen. I am going to do the same test and see what other logs are produced by the driver. In the meanwhile, can you use "type: snap" in volume snapshot class and re-do the test?

I must also note that the main problem appears to be in either CSI driver or Longhorn itself. From Velero's point of view, it is issuing the delete request. So you may have better luck pursuing this in Longhorn forums by describing how you deleted VolumeSnapshot class and how it didn't delete Longhorn snapshot.

R-Studio · 2023-07-24T07:41:40Z

@draghuram thanks with type: snap Velero removes that snapshot as expected but no backup was created (only a snapshot).
Were you able to test this already and do you found something interessting?

draghuram · 2023-07-24T15:22:22Z

@R-Studio, in our tests, we see that Longhorn snapshots are deleted as expected. You should use "type: snap" and then decide what type of Velero backups you want. The most basic is snapshots which you are already doing. The second option is do file system backups which will transfer the contents of Longhorn PVs to object storage. But they read data from live PV. A new feature is coming in 1.12 (slated to be released by end of August) that will take PV snapshot first and then backup data in the snapshot to object storage.

R-Studio · 2023-08-07T06:40:56Z

@draghuram thanks for your help, but as I said, with type: snap velero only triggers a Longhorn snapshot and not a Longhorn backup. The difference is that a longhorn snapshot is only stored locally and a longhorn backup is written to a s3 bucket (external).
If I understand Velero, the CSI plugin and Longhorn correctly, I would expect Velero to store the Kubernetes manifests (YAML's), create CSI snapshots through the plugin and Longhorn to notice the CSI snapshot and create the Longhorn backup.
Or have I misunderstood something?

Satsank · 2023-08-07T16:11:18Z

1. With type: bak, there is no "CSI snapshot" of a longhorn volume, as conventionally defined. Longhorn uses this standard request to instead run its own volume backup to S3. 2. With type: snap, longhorn csi driver implements the traditional CSI snapshot, but it isn't running a backup to S3. 3. Velero's own FSB ignores longhorn CSI mechanisms altogether and backs up to S3. 4. Velero 1.12's new data mover could help with mounting a snapshot created through type: snap and run a backup to S3. 5. Unfortunately, Velero 1.12 functionality is not ideal either because longhorn doesn't support mounting a snapshot without initiating a full data retrieval process behind the scenes.

…

On Mon, Aug 7, 2023 at 2:41 AM Robin Hermann ***@***.***> wrote: @draghuram <https://github.com/draghuram> thanks for your help, but as I said, with type: snap velero only triggers a Longhorn snapshot and not a Longhorn backup. The difference is that a longhorn snapshot is only stored locally and a longhorn backup is written to a s3 bucket (external). If I understand Velero, the CSI plugin and Longhorn correctly, I would expect Velero to store the Kubernetes manifests (YAML's), create CSI snapshots through the plugin and Longhorn to notice the CSI snapshot and create the Longhorn backup. Or have I misunderstood something? — Reply to this email directly, view it on GitHub <#6179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGFI2VD7CMP4GA4KSSQYM5LXUCEYVANCNFSM6AAAAAAXJVIAVU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

R-Studio · 2023-12-05T11:54:05Z

I update to Velero v1.12.2, using the velero-plugin-for-csi v0.6.2 & velero-plugin-for-aws v1.8.2, but still the same issue.

lucatr · 2023-12-12T16:52:01Z

@Satsank thanks for your inputs. Point 5 catched my attention. I'm having the same issue and tested Velero 1.12.2 data mover as a workaround. Overall goal is to reduce data consumption on Longhorn by moving snapshot/backup data directly to S3. My tests failed because Velero wasn't able to mount the snapshots, see error below. My assumption is that this is because Longhorn snapshots are incremental, can therefore not be mounted just like that. Is there a way to make this work with Longhorn?

longhorn-csi-plugin time="2023-12-12T16:39:47Z" level=info msg="NodeStageVolume: req: {\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/5ba5d59b93a7a2e3a3a87645184e0d0d4faccb58c8efe12e31d46981cc70454c/globalmount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"xfs\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"dataLocality\":\"disabled\",\"dataSource\":\"snap://pvc-542dfa21-ec13-4aa5-a81a-e772587c6acf/snapshot-3d6fe83a-0350-4dd6-8fd0-0961570bb3b9\",\"fromBackup\":\"\",\"fsType\":\"xfs\",\"numberOfReplicas\":\"3\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1702042978770-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-8ef7248c-d20b-4065-b38e-69734e3a629f\"}" func=csi.logGRPC file="server.go:132"
longhorn-csi-plugin time="2023-12-12T16:39:47Z" level=info msg="Volume pvc-8ef7248c-d20b-4065-b38e-69734e3a629f using user and longhorn provided xfs fs creation params: -ssize=4096 -bsize=4096" func="csi.(*NodeServer).getMounter" file="node_server.go:763"
longhorn-csi-plugin time="2023-12-12T16:39:47Z" level=info msg="Volume pvc-8ef7248c-d20b-4065-b38e-69734e3a629f device /dev/longhorn/pvc-8ef7248c-d20b-4065-b38e-69734e3a629f contains filesystem of format xfs" func="csi.(*NodeServer).NodeStageVolume" file="node_server.go:421" component=csi-node-server function=NodeStageVolume
longhorn-csi-plugin time="2023-12-12T16:39:47Z" level=info msg="Trying to ensure mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/5ba5d59b93a7a2e3a3a87645184e0d0d4faccb58c8efe12e31d46981cc70454c/globalmount" func=csi.ensureMountPoint file="util.go:288"
longhorn-csi-plugin time="2023-12-12T16:39:47Z" level=info msg="Mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/5ba5d59b93a7a2e3a3a87645184e0d0d4faccb58c8efe12e31d46981cc70454c/globalmount try opening and syncing dir to make sure it's healthy" func=csi.ensureMountPoint file="util.go:296"
longhorn-csi-plugin E1212 16:39:47.578599    6145 mount_linux.go:232] Mount failed: exit status 32                                                                                                                                            
longhorn-csi-plugin Mounting command: mount
longhorn-csi-plugin Mounting arguments: -t xfs -o defaults /dev/longhorn/pvc-8ef7248c-d20b-4065-b38e-69734e3a629f /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/5ba5d59b93a7a2e3a3a87645184e0d0d4faccb58c8efe12e31d46981cc70454c/globalmount
longhorn-csi-plugin Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/5ba5d59b93a7a2e3a3a87645184e0d0d4faccb58c8efe12e31d46981cc70454c/globalmount: wrong fs type, bad option, bad superblock on /dev/longhorn/pvc-8ef7248c-d20b-4065-b38e-69734e3a629f, missing codepage or helper program, or other error.
longhorn-csi-plugin time="2023-12-12T16:39:47Z" level=error msg="NodeStageVolume: err: rpc error: code = Internal desc = mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t xfs -o defaults /dev/longhorn/pvc-8ef7248c-d20b-4065-b38e-69734e3a629f /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/5ba5d59b93a7a2e3a3a87645184e0d0d4faccb58c8efe12e31d46981cc70454c/globalmount\nOutput: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/5ba5d59b93a7a2e3a3a87645184e0d0d4faccb58c8efe12e31d46981cc70454c/globalmount: wrong fs type, bad option, bad superblock on /dev/longhorn/pvc-8ef7248c-d20b-4065-b38e-69734e3a629f, missing codepage or helper program, or other error.\n" func=csi.logGRPC file="server.go:138"

draghuram · 2023-12-12T18:46:31Z

@lucatr PVCs can be created from Longhorn snapshots and mounted on data mover pods. We do something similar in CloudCasa and it works. However, the point is that when a PVC is created from Longhorn snapshot, Longhorn creates a new volume and "copies" data from snapshot which is very inefficient because the PVC is going to be deleted as soon as backup is done. The best one can do is to configure minimum replicas (1 really) and minimize copying. CloudCasa does provide this option but I really hope that Longhorn optimizes creation of volumes from snapshots by totally eliminating copy.

Having said that, the error in your case seems to be different so it may be better to open a separate issue. When exactly are you seeing this error? Attaching Velero backup bundle may be useful.

Feel free to contact is at CloudCasa as we have lot of experience with Longhorn PV backups.

lucatr · 2023-12-13T16:39:42Z

@draghuram thanks for the feedback. When I kick off the Velero Backup from schedule [1], a snapshot is created successfully, status changes to "ready to use". PVC is created as well, events say "successfully provisioned". PV looks fine as well, same is true for the volume in Longhorn GUI (says healthy, ready). But the snapshot-exposer pods are stuck in ContainerCreating status. Pod events show the same error about mounting issues I also see in the longhorn-csi-plugin pod logs. It's stuck like this for about 30 min, before pods are killed and backup is marked as PartiallyFailed in Velero.

Not sure what the Velero backup bundle is, or what other logs might be interesting in this case.

As suggested I'll go ahead and create a separate issue for this later this week.

[1]

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: harbor-daily-0200
  namespace: velero
spec:
  schedule: 30 0 * * *
  template:
    includedNamespaces:
    - 'harbor'
    includedResources:
    - '*'
    snapshotVolumes: true
    storageLocation: minio
    volumeSnapshotLocations:
      - longhorn
    snapshotMoveData: true
    datamover: velero
    ttl: 168h0m0s

PhanLe1010 · 2023-12-22T00:40:53Z

Hi everyone! I am from the Longhorn team. It is a great discussion so far in this thread and I would like to join the conversation.

First of all, like others already mentioned, a CSI VolumeSnapshot (this is a Kubernetes upstream CRD) can be associated with either a Longhorn snapshot (live inside the cluster data) or a Longhorn backup (live outside of the cluster in S3 endpoint). For example, the CSI VolumeSnapshot created by this VolumeSnapshotClass corresponds to a Longhorn snapshot (link):

kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
metadata:
  name: longhorn-snapshot-vsc
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: snap

and the CSI VolumeSnapshot created by this VolumeSnapshotClass corresponds to a Longhorn backup (link):

kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
metadata:
  name: longhorn-backup-vsc
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: bak

CSI VolumeSnapshot of `type: snap`

When you create a CSI VolumeSnapshot of type: snap, Longhorn provisions a in-cluster Longhorn snapshot (the ones shown in this picture https://user-images.githubusercontent.com/7498854/234019298-e7cd2853-199b-4702-b020-24227c0a13bc.png). When you delete this CSI VolumeSnapshot, Longhorn will delete the snapshot in this picture. The benefit of this approach us there is no leftover resource in this case. However, as pointed out by others, this CSI VolumeSnapshot is local inside the cluster, it is not backed up to the remote S3 endpoint. If someone (like the Velero 1.12.2 data mover) try to mount (also means clone) this CSI VolumeSnapshot and upload the data to S3 endpoint, Longhorn will need to first fully copy the data to a new PVC, then the data move can upload data to the S3 endpoint. Finally, data mover can delete the newly cloned PVC at the end. This is a costly operation and doesn't seems fit this backup use-case much. This feature (clone a new PVC from a CSI VolumeSnapshot) was intended for the use-case likes VM cloning (in Harvester) in which a brand new VM and its data is cloned from another VM.

CSI VolumeSnapshot of `type: bak`

On the other hand, when you create a CSI VolumeSnapshot of type: bak, Longhorn will:

Take a in-cluster Longhorn snapshot (the ones shown in this picture https://user-images.githubusercontent.com/7498854/234019298-e7cd2853-199b-4702-b020-24227c0a13bc.png).
Create and upload a backup from that Longhorn snapshot to S3 endpoint

When you delete a CSI VolumeSnapshot of type: bak, Longhorn will:

Delete the backup from S3 endpoint
However, Longhorn will NOT delete the in-cluster Longhorn snapshot (the ones shown in this picture https://user-images.githubusercontent.com/7498854/234019298-e7cd2853-199b-4702-b020-24227c0a13bc.png). The reason for this behavior is simply that after the backup finished, a Longhorn snapshot is no longer link to the backup. A Longhorn snapshot can be corresponding to multiple Longhorn backups. Therefore, currently, we don't delete the Longhorn snapshot when a backup is deleted. However, I do see an opportunity to make improvement here, maybe Longhorn should allow the user a setting to specifically find and clean up the Longhorn snapshot when a Longhorn backup is deleted.

The downside of this CSI VolumeSnapshot of type: bak currently is that there are leftover Longhorn snapshot (the ones shown in this picture https://user-images.githubusercontent.com/7498854/234019298-e7cd2853-199b-4702-b020-24227c0a13bc.png) after deleting the CSI VolumeSnapshot. However, the upside is huge, this method is the native way to backup data to S3 endpoint in Longhorn. It is fast and efficient.

Conclusion:

I would recommend using the CSI VolumeSnapshot of type: bak as it is the native way to backup data to S3 endpoint in Longhorn. It is fast and efficient. To overcome its limitation (leftover Longhorn snapshot), I suggest:

Setup a snapshot-delete recurring job to periodically clean up the leftover Longhorn snapshots
Create an improvement ticket to Longhorn repo to ask Longhorn allow the user a setting to specifically find and clean up the Longhorn snapshot when a Longhorn backup is deleted.

draghuram · 2023-12-22T21:38:12Z

Hi @PhanLe1010, Thanks for detailed information. It is very helpful.

I personally think it is better not to use "type: bak" snapshots as a way of backup because this is Longhorn specific. One may easily have multiple clusters/CSI drivers and ideally, you need a unified backup strategy (such as the one provided by Velero or CloudCasa) that works across different storage types. In that respect, all you need from the storage is an efficient way to snapshot PVs and also create PVs from snapshots.

I think Longhorn already took a step in this direction by implementing "true" snapshots (starting from 1.3). It will be nice if copy can be avoided when a PVC is created from snapshot but it looks like it is not in the roadmap?

R-Studio · 2024-01-08T08:40:06Z

@PhanLe1010 Thanks for your great explaination.

We already use snapshot-delete recurring job as a workaround, but it is pain because we want to keep as few snapshots as possible but the retention of the velero snapshots is defined by the application owner.
How can we create a improvement ticket?

PhanLe1010 · 2024-01-09T02:12:11Z

Hi @PhanLe1010, Thanks for detailed information. It is very helpful.

I personally think it is better not to use "type: bak" snapshots as a way of backup because this is Longhorn specific. One may easily have multiple clusters/CSI drivers and ideally, you need a unified backup strategy (such as the one provided by Velero or CloudCasa) that works across different storage types. In that respect, all you need from the storage is an efficient way to snapshot PVs and also create PVs from snapshots.

Hi @draghuram, I see your point about the unified backup strategy 👍. This becomes a choice for users to choose between a unified solution and a native solution with the tradeoff between convenience and performance.

Btw, can this unified backup strategy backup/restore a volume in block mode currently?

I think Longhorn already took a step in this direction by implementing "true" snapshots (starting from 1.3). It will be nice if copy can be avoided when a PVC is created from snapshot but it looks like it is not in the roadmap?

This will require a big effort from the Longhorn side. Could you create a GitHub ticket at https://github.com/longhorn/longhorn/issues/new/choose so that Longhorn PM can evaluate whether they want to proceed

PhanLe1010 · 2024-01-09T02:47:01Z

Hi @R-Studio

We already use snapshot-cleanup recurring job as a workaround, but it is pain because we want to keep as few snapshots as possible but the retention of the velero snapshots is defined by the application owner.

I think there is a bit of misunderstanding here. When using the type: bak, the number of velero snapshots (which is the CSI snapshots) shouldn't be affected by the number of Longhorn snapshots. The number of velero snapshots is equal to the number of Longhorn backups instead. So the flow is:

Velero creates a CSI snapshots
Longhorn creates a Longhorn snapshot
Longhorn creates a Longhorn backup from the above Longhorn snapshot
The Longhorn snapshot is no longer needed, you can delete this one by the snapshot-delete recurring job.
From now on, velero snapshot and Longhorn backup are bonded to each other. Deleting velero snapshot will lead to the deletion of Longhorn backup. Restore velero snapshot will lead to the restoration of Longhorn backup

How can we create a improvement ticket?

You can create a GitHub ticket at https://github.com/longhorn/longhorn/issues/new/choose. The idea for the improvement may be:
Create an improvement ticket to Longhorn repo to ask Longhorn allow the user a setting to specifically find and clean up the Longhorn snapshot when a Longhorn backup is deleted.

R-Studio · 2024-01-09T08:43:31Z

@PhanLe1010
We tested the your suggestion and it works but not with snapshot-cleanup recurring job, we had to use snapshot-delete recurring job. And then it works like you described including restore with Velero, but we noticed in the Longhorn GUI of the PVC the backups are missing (indepentend of the deletionPolicy of the VolumeSnapshotClass:

But in the Backup tab it is there:

Anyway thanks for the hint, now we are able to save more storage space.

PhanLe1010 · 2024-01-09T19:17:19Z

Hi @R-Studio

We tested the your suggestion and it works but not with snapshot-cleanup recurring job, we had to use snapshot-delete recurring job

Sorry for the mistake! Yes, it should be snapshot-delete recurring job instead of snapshot-cleanup recurring job

And then it works like you described including restore with Velero, but we noticed in the Longhorn GUI of the PVC the backups are missing (indepentend of the deletionPolicy of the VolumeSnapshotClass:

I see the confusion. Yes, the volume detail page (the first picture) only shows Longhorn backups that have an existing Longhorn snapshot. The backup page (the second picture) shows all Longhorn backups

draghuram · 2024-01-09T21:47:00Z

@PhanLe1010 Yes, Velero does support backups of BLOCK type PVs (CloudCasa contributed code for that recently). I will open a github request for copy-less creation of PVCs from Longhorn snapshots.

PhanLe1010 · 2024-01-10T01:53:53Z

@PhanLe1010 Yes, Velero does support backups of BLOCK type PVs (CloudCasa contributed code for that recently). I will open a github request for copy-less creation of PVCs from Longhorn snapshots.

Awesome! Thanks @draghuram !

PhanLe1010 · 2024-01-23T20:54:22Z

Hi @draghuram Have you create the ticket on Longhorn repo yet?

draghuram · 2024-01-26T21:31:58Z

Just opened the feature request: longhorn/longhorn#7794.

R-Studio mentioned this issue Apr 24, 2023

[BUG] Longhorn Snapshots are not deleted after expired Backups (Velero) longhorn/longhorn#5802

Closed

ywk253100 added the Area/CSI Related to Container Storage Interface support label Apr 26, 2023

ywk253100 self-assigned this May 8, 2023

phoenix-bjoern mentioned this issue Jan 3, 2024

snapshot-controller logs report failure frequently kubernetes-csi/external-snapshotter#748

Open

draghuram mentioned this issue Jan 26, 2024

[FEATURE] Support creation of PVC from CSI snapshot without copying data longhorn/longhorn#7794

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Longhorn Snapshots are not deleted after expired Backups (Velero) #6179

[BUG] Longhorn Snapshots are not deleted after expired Backups (Velero) #6179

R-Studio commented Apr 24, 2023

R-Studio commented Apr 24, 2023

ywk253100 commented Apr 26, 2023

R-Studio commented May 1, 2023

draghuram commented May 1, 2023 •

edited

Loading

R-Studio commented May 15, 2023 •

edited

Loading

R-Studio commented May 15, 2023 •

edited

Loading

draghuram commented May 15, 2023

R-Studio commented May 22, 2023 •

edited

Loading

R-Studio commented Jun 5, 2023

R-Studio commented Jun 6, 2023

R-Studio commented Jun 6, 2023 •

edited

Loading

R-Studio commented Jul 17, 2023

draghuram commented Jul 17, 2023

R-Studio commented Jul 24, 2023 •

edited

Loading

draghuram commented Jul 24, 2023

R-Studio commented Aug 7, 2023

Satsank commented Aug 7, 2023 via email

R-Studio commented Dec 5, 2023

lucatr commented Dec 12, 2023

draghuram commented Dec 12, 2023

lucatr commented Dec 13, 2023

PhanLe1010 commented Dec 22, 2023 •

edited

Loading

draghuram commented Dec 22, 2023

R-Studio commented Jan 8, 2024 •

edited

Loading

PhanLe1010 commented Jan 9, 2024

PhanLe1010 commented Jan 9, 2024 •

edited

Loading

R-Studio commented Jan 9, 2024

PhanLe1010 commented Jan 9, 2024 •

edited

Loading

draghuram commented Jan 9, 2024 •

edited

Loading

PhanLe1010 commented Jan 10, 2024

PhanLe1010 commented Jan 23, 2024

draghuram commented Jan 26, 2024

[BUG] Longhorn Snapshots are not deleted after expired Backups (Velero) #6179

[BUG] Longhorn Snapshots are not deleted after expired Backups (Velero) #6179

Comments

R-Studio commented Apr 24, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Environment

Additional context

Velero Backup Schedule for Harbor

VolumeSnapshotClass

VolumeSnapshotClass

VolumeSnapshotLocation

R-Studio commented Apr 24, 2023

ywk253100 commented Apr 26, 2023

R-Studio commented May 1, 2023

draghuram commented May 1, 2023 • edited Loading

R-Studio commented May 15, 2023 • edited Loading

R-Studio commented May 15, 2023 • edited Loading

draghuram commented May 15, 2023

R-Studio commented May 22, 2023 • edited Loading

R-Studio commented Jun 5, 2023

R-Studio commented Jun 6, 2023

R-Studio commented Jun 6, 2023 • edited Loading

R-Studio commented Jul 17, 2023

draghuram commented Jul 17, 2023

R-Studio commented Jul 24, 2023 • edited Loading

draghuram commented Jul 24, 2023

R-Studio commented Aug 7, 2023

Satsank commented Aug 7, 2023 via email

R-Studio commented Dec 5, 2023

lucatr commented Dec 12, 2023

draghuram commented Dec 12, 2023

lucatr commented Dec 13, 2023

PhanLe1010 commented Dec 22, 2023 • edited Loading

CSI VolumeSnapshot of type: snap

CSI VolumeSnapshot of type: bak

Conclusion:

draghuram commented Dec 22, 2023

R-Studio commented Jan 8, 2024 • edited Loading

PhanLe1010 commented Jan 9, 2024

PhanLe1010 commented Jan 9, 2024 • edited Loading

R-Studio commented Jan 9, 2024

PhanLe1010 commented Jan 9, 2024 • edited Loading

draghuram commented Jan 9, 2024 • edited Loading

PhanLe1010 commented Jan 10, 2024

PhanLe1010 commented Jan 23, 2024

draghuram commented Jan 26, 2024

draghuram commented May 1, 2023 •

edited

Loading

R-Studio commented May 15, 2023 •

edited

Loading

R-Studio commented May 15, 2023 •

edited

Loading

R-Studio commented May 22, 2023 •

edited

Loading

R-Studio commented Jun 6, 2023 •

edited

Loading

R-Studio commented Jul 24, 2023 •

edited

Loading

PhanLe1010 commented Dec 22, 2023 •

edited

Loading

CSI VolumeSnapshot of `type: snap`

CSI VolumeSnapshot of `type: bak`

R-Studio commented Jan 8, 2024 •

edited

Loading

PhanLe1010 commented Jan 9, 2024 •

edited

Loading

PhanLe1010 commented Jan 9, 2024 •

edited

Loading

draghuram commented Jan 9, 2024 •

edited

Loading