failed to take snapshot of the volume -- failed with error: error parsing volume id -- should at least contain two # #8122

sivarama-p-raju · 2024-08-17T16:53:22Z

What steps did you take and what happened:
When a normal scheduled backup is run, the backup completes with state "PartiallyFailed". On reviewing the description of the backup, the below errors were found repeating many times:

  Velero:    message: /VolumeSnapshotContent snapcontent-b1cf790e-dfd6-4689-9881-ab9d329cea16 has error: Failed to check and update snapshot content: failed to take snapshot of the volume <AZ-RG>-main-dev: "rpc error: code = Internal desc = GetFileShareInfo(<AZ-RG>-main-dev) failed with error: error parsing volume id: \"<AZ-RG>-main-dev\", should at least contain two #"

The volume in question "-main-dev" is using the storageclass "azurefile-csi", but is not a dynamically provisioned volume.

There are other volumes using the same storageclass but are dynamically provisioned volumes, and the volume handle of those volumes contains at least two #, and so the requirement seems to be met.

Is this really a hard requirement ? And is this requirement only applicable to the volumes using "azurefile-csi" storageclass ?

What did you expect to happen:

Expect the backup to complete successfully without the errors.

Anything else you would like to add:
This is on AKS cluster running Kubernetes v1.29.4. We have a similar issue on multiple AKS clusters.

Environment:

Velero version (use velero version): v1.14.0
Velero features (use velero client config get features):
Kubernetes version (use kubectl version): v1.29.4
Kubernetes installer & version: AKS
Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release): NA

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

blackpiglet · 2024-08-19T02:39:50Z

I'm a little confused about your scenario.
If a PVC is not dynamically provisioned, how is it related to StorageClass?

For the error, I think there is a limitation on the format of the shared file ID in the Azure File CSI code.
https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/ed0c596cf08226abce9091b18e82e9261ed99131/pkg/azurefile/azurefile.go#L474-L481

reasonerjt · 2024-08-19T03:48:13Z

Seems Azure thinks if a PV is provisioned by azurefile-csi the volume ID MUST have 2 # s.

sivarama-p-raju · 2024-08-19T08:50:55Z

@blackpiglet

Thank you for your response.

The PV is provisioned statically using a manifest with a specific name. The PVC then defines use of the PV.
This is also a valid way of provisioning volume as you can find in the official documentation here.
Here are the manifests used in this use-case:

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: <pv-name>
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi
  csi:
    driver: file.csi.azure.com
    nodeStageSecretRef:
      name: file-share-secret
      namespace: <secret-ns>
    readOnly: false
    volumeAttributes:
      resourceGroup: <azure-resource-group>
      shareName: <share-name>
    volumeHandle: <azure-resource-group>-main-dev
  mountOptions:
  - dir_mode=0777
  - file_mode=0777
  - uid=0
  - gid=0
  - mfsymlinks
  - cache=strict
  - nosharesock
  - nobrl
  persistentVolumeReclaimPolicy: Retain
  storageClassName: azurefile-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: <pvc-name>
  namespace: <pvc-ns>
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: azurefile-csi
  volumeName: <pv-name>
---

Thank you for the link to the Azure file CSI code which shows the limitation on the format of the volume handle.

Could you please let me know your thoughts on what I could do in this case ?

blackpiglet · 2024-08-20T05:29:48Z

Due to the CSI Azure File snapshotter limitation, and the volume was not created by the CSI way, I think we cannot back up the volume by the CSI way.
How about using the file system backup?
https://velero.io/docs/v1.14/file-system-backup/

sivarama-p-raju · 2024-08-26T08:35:48Z

@blackpiglet

Thank you for the update. Yes, I plan to test the filesystem backup method for the problematic volumes.

I plan to annotate the deployments using the said volumes with the below, so that filesystem backup is done for those only:

backup.velero.io/backup-volumes: <volume name>

sivarama-p-raju · 2024-09-06T09:26:17Z

@blackpiglet

Please note that I annotated the pods with the below annotation:

backup.velero.io/backup-volumes: <volume name>

On running a fresh backup, I see that it completed with the status "PartiallyFailed" and have the below errors on a velero backup describe:

Errors:
  Velero:    name: /<pod-name> message: /Error backing up item error: /daemonset pod not found in running state in node <aks-node>
             name: /<pod-name> message: /Error backing up item error: /daemonset pod not found in running state in node <aks-node>
             name: /<pod-name> message: /Error backing up item error: /daemonset pod not found in running state in node <aks-node>

Not sure I understand this error.
The pods are not part of a daemonset but are part of a deployment.

Could you please let me know your thoughts on the same ?

sseago · 2024-09-06T13:58:55Z

@sivarama-p-raju Are you running the node agent? You told velero to use fs-backup for those pods, but if you're using fs-backup with kopia (or restic), then you need to run the node agent daemonset. From the error message, either the node agent isn't running at all, or it's not running on the nodes with your pods for some reason.

sivarama-p-raju · 2024-09-07T15:29:03Z

@sseago Thank you for your reply. We did not have node agent running on the particular cluster. I enabled deploy of node agent and triggered a new backup after this.

There are new errors this time:

Errors:
  Velero:    name: /<pod-name> message: /Error backing up item error: /failed to wait BackupRepository, errored early: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system

On searching more on this, I found this issue, and tried doing the same.

The backups now complete successfully.

However, I notice that there are <ns>-default-kopia-<xxxxx>-maintain-job-.... now started failing with below errors:

time="2024-09-07T15:15:07Z" level=info msg="use the storage account URI retrieved from the storage account properties \"https://<storage-account>.blob.core.windows.net/\""
time="2024-09-07T15:15:07Z" level=info msg="auth with Azure AD"
time="2024-09-07T15:15:08Z" level=info msg="use the storage account URI retrieved from the storage account properties \"https://<storage-account>.blob.core.windows.net/\""
time="2024-09-07T15:15:08Z" level=info msg="auth with Azure AD"
time="2024-09-07T15:15:08Z" level=warning msg="active indexes [xn0_05f18fdeda88f2ebfc065202161e93f0-s745559b64bb7df4f12c-c1 xn0_256311ded6016c2ac402604a412c7e79-sb73cf658e60c314812c-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-09-07T15:15:08Z" level=info msg="Start to open repo for maintenance, allow index write on load" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:165"
time="2024-09-07T15:15:08Z" level=info msg="use the storage account URI retrieved from the storage account properties \"https://<storage-account>.blob.core.windows.net/\""
time="2024-09-07T15:15:08Z" level=info msg="auth with Azure AD"
time="2024-09-07T15:15:09Z" level=warning msg="active indexes [xn0_05f18fdeda88f2ebfc065202161e93f0-s745559b64bb7df4f12c-c1 xn0_256311ded6016c2ac402604a412c7e79-sb73cf658e60c314812c-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-09-07T15:15:09Z" level=info msg="Succeeded to open repo for maintenance" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:172"
time="2024-09-07T15:15:09Z" level=error msg="An error occurred when running repo prune" error="failed to prune repo: error to prune backup repo: error to maintain repo: error to run maintenance under mode auto: maintenance must be run by designated user: " error.file="/go/src/github.com/vmware-tanzu/velero/pkg/repository/udmrepo/kopialib/lib_repo.go:219" error.function="github.com/vmware-tanzu/velero/pkg/repository/udmrepo/kopialib.(*kopiaMaintenance).runMaintenance" logSource="pkg/cmd/cli/repomantenance/maintenance.go:72"

I needed your help with the below queries:

Is configuring the extraVolumes and extraVolumeMounts the correct solution to the backup error unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system ?
- If yes, how should the problems with the maintain job be fixed ?
- If no, what is the solution ?

Thanks a lot in advance.

sivarama-p-raju · 2024-10-10T07:36:43Z

@sseago @blackpiglet
Could you please let me know your thoughts about the failing kopia maintain jobs that I mentioned ?

sseago · 2024-10-14T18:43:03Z

@sivarama-p-raju Kopia must be able to write to $HOME -- in the usual default configurations this should be possible. Is the root filesystem mounted read-only?

bernardgut · 2024-11-29T21:53:40Z

@sivarama-p-raju
I had the same issue. It is either a bug with some CRDs that do not get deleted when you helm uninstall OR some race condition when node-agent start before velero-*. I am not sure exactly what is going on and I dont have time to debug.

In any case, here is the fix:

delete the velero ns
helm install... (because you did not label the ns, node-agent-* daemonset pods should NOT start
wait for velero-... to start
once velero is started, run k label ns velero pod-security.kubernetes.io/enforce=privileged. the node-agent-* pods start
you should not see any maintenance jobs at all. You can now run velero backup create test --snapshot-move-data --include-namespaces <some-namespace-with-pvc> ... and it should work.

et voilà.

@sivarama-p-raju I am not sure you understand but there is a reason why everyone is trying to run your pods with containerSecurityContext and podSecurityContext set with minimum privilege. The velero pod has access to all your data basically. The node-agent-* pods can read/write into any application data on your cluster. The attack surface is insane.

sivarama-p-raju · 2024-12-01T12:14:31Z

@bernardgut Thank you for the provided details.

As @sseago mentioned, it should work in the usual default configurations. However, I did not get chance to work on fixing this and I removed use of Kopia completely in our AKS clusters.

We still have a couple of volumes that are provisioned the way I decribed earlier (not dynamically provisioned) and we are still trying to figure out a way to back those up.

I will close this issue for now and try out the fix you mentioned.

bernardgut · 2024-12-08T15:08:06Z

@sivarama-p-raju np.

Also here is my current config, in case it helps you or anyone else who needs to run velero without any priviledges successfully :

values.yaml

podSecurityContext:
  runAsUser: 1000
  runAsGroup: 1000
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault
containerSecurityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true

# add extra volumes and volume mounts for running with
# read-only root filesystem:
extraVolumes:
- emptyDir: {}
  name: udmrepo
- emptyDir: {}
  name: cache
extraVolumeMounts:
- mountPath: /udmrepo
  name: udmrepo
- mountPath: /.cache
  name: cache

kubectl:
  containerSecurityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
    readOnlyRootFilesystem: true


initContainers:
- name: velero-plugin-for-aws
  image: velero/velero-plugin-for-aws:v1.11.0
  volumeMounts:
    - mountPath: /target
      name: plugins
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
    readOnlyRootFilesystem: true

configuration:
  features: EnableCSI,EnableAPIGroupVersions
  backupStorageLocation:
  - name: s3.REDACTEDDOMAIN
    provider: aws
    bucket: p0-backup
    default: true
    config:
      region: minio
      s3ForcePathStyle: "true"
      s3Url: https://s3.REDACTEDDOMAIN
  volumeSnapshotLocation:
  - name: s3.REDACTEDDOMAIN
    provider: aws
    config:
      region: minio

deployNodeAgent: true

nodeAgent:
  priorityClassName: system-node-critical
  podSecurityContext:
   runAsUser: 1000
   runAsGroup: 1000
   fsGroup: 0
   runAsNonRoot: true
   seccompProfile:
     type: RuntimeDefault
  containerSecurityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
  extraArgs: 
    - --node-agent-configmap=concurrency-config

concurrency-config.json

{
    "loadConcurrency": {
        "globalConfig": 2
    }
}

install procedure:

k create cm concurrency-config -n velero --from-file=concurrency-config.json
helm upgrade --install -n velero  -f values.yaml velero vmware-tanzu/velero \                                     
  --set-file s3-home.secret \
  --create-namespace
# wait 30s for velero to come up then
k label ns velero pod-security.kubernetes.io/enforce=privileged

I havent found a way to run the node-agent-..., worker-.. and maintenance-.. pods as non-root/unpriviledged. Please let me know if you find one.

Right now they bypass kubernetes API and therefore security (RBAC) entirely by mounting a hostpath of the /var/lib/kubelet/pods in every pod for volume data copying instead of dynamically mounting the volumes in the worker pods as needed. The problem line is here : https://github.com/vmware-tanzu/helm-charts/blob/f0f07defa8273b0493806032349539f948da40b4/charts/velero/templates/node-agent-daemonset.yaml#L73

If I have time I might create an issue as I think this is a huge security risk. But right now I am too busy,

Cheers.

blackpiglet added Area/Cloud/Azure Area/CSI Related to Container Storage Interface support labels Aug 19, 2024

reasonerjt added the Needs info Waiting for information label Aug 19, 2024

sivarama-p-raju closed this as completed Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to take snapshot of the volume -- failed with error: error parsing volume id -- should at least contain two # #8122

failed to take snapshot of the volume -- failed with error: error parsing volume id -- should at least contain two # #8122

sivarama-p-raju commented Aug 17, 2024

blackpiglet commented Aug 19, 2024 •

edited

Loading

reasonerjt commented Aug 19, 2024

sivarama-p-raju commented Aug 19, 2024 •

edited

Loading

blackpiglet commented Aug 20, 2024

sivarama-p-raju commented Aug 26, 2024

sivarama-p-raju commented Sep 6, 2024

sseago commented Sep 6, 2024

sivarama-p-raju commented Sep 7, 2024

sivarama-p-raju commented Oct 10, 2024

sseago commented Oct 14, 2024

bernardgut commented Nov 29, 2024

sivarama-p-raju commented Dec 1, 2024

bernardgut commented Dec 8, 2024 •

edited

Loading

failed to take snapshot of the volume -- failed with error: error parsing volume id -- should at least contain two # #8122

failed to take snapshot of the volume -- failed with error: error parsing volume id -- should at least contain two # #8122

Comments

sivarama-p-raju commented Aug 17, 2024

blackpiglet commented Aug 19, 2024 • edited Loading

reasonerjt commented Aug 19, 2024

sivarama-p-raju commented Aug 19, 2024 • edited Loading

blackpiglet commented Aug 20, 2024

sivarama-p-raju commented Aug 26, 2024

sivarama-p-raju commented Sep 6, 2024

sseago commented Sep 6, 2024

sivarama-p-raju commented Sep 7, 2024

sivarama-p-raju commented Oct 10, 2024

sseago commented Oct 14, 2024

bernardgut commented Nov 29, 2024

sivarama-p-raju commented Dec 1, 2024

bernardgut commented Dec 8, 2024 • edited Loading

blackpiglet commented Aug 19, 2024 •

edited

Loading

sivarama-p-raju commented Aug 19, 2024 •

edited

Loading

bernardgut commented Dec 8, 2024 •

edited

Loading