Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to take snapshot of the volume -- failed with error: error parsing volume id -- should at least contain two # #8122

Closed
sivarama-p-raju opened this issue Aug 17, 2024 · 13 comments
Labels
Area/Cloud/Azure Area/CSI Related to Container Storage Interface support Needs info Waiting for information

Comments

@sivarama-p-raju
Copy link

What steps did you take and what happened:
When a normal scheduled backup is run, the backup completes with state "PartiallyFailed". On reviewing the description of the backup, the below errors were found repeating many times:

  Velero:    message: /VolumeSnapshotContent snapcontent-b1cf790e-dfd6-4689-9881-ab9d329cea16 has error: Failed to check and update snapshot content: failed to take snapshot of the volume <AZ-RG>-main-dev: "rpc error: code = Internal desc = GetFileShareInfo(<AZ-RG>-main-dev) failed with error: error parsing volume id: \"<AZ-RG>-main-dev\", should at least contain two #"

The volume in question "-main-dev" is using the storageclass "azurefile-csi", but is not a dynamically provisioned volume.

There are other volumes using the same storageclass but are dynamically provisioned volumes, and the volume handle of those volumes contains at least two #, and so the requirement seems to be met.

Is this really a hard requirement ? And is this requirement only applicable to the volumes using "azurefile-csi" storageclass ?

What did you expect to happen:

Expect the backup to complete successfully without the errors.

Anything else you would like to add:
This is on AKS cluster running Kubernetes v1.29.4. We have a similar issue on multiple AKS clusters.

Environment:

  • Velero version (use velero version): v1.14.0
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version): v1.29.4
  • Kubernetes installer & version: AKS
  • Cloud provider or hardware configuration: Azure
  • OS (e.g. from /etc/os-release): NA

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@blackpiglet
Copy link
Contributor

blackpiglet commented Aug 19, 2024

I'm a little confused about your scenario.
If a PVC is not dynamically provisioned, how is it related to StorageClass?

For the error, I think there is a limitation on the format of the shared file ID in the Azure File CSI code.
https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/ed0c596cf08226abce9091b18e82e9261ed99131/pkg/azurefile/azurefile.go#L474-L481

@blackpiglet blackpiglet added Area/Cloud/Azure Area/CSI Related to Container Storage Interface support labels Aug 19, 2024
@reasonerjt reasonerjt added the Needs info Waiting for information label Aug 19, 2024
@reasonerjt
Copy link
Contributor

Seems Azure thinks if a PV is provisioned by azurefile-csi the volume ID MUST have 2 # s.

@sivarama-p-raju
Copy link
Author

sivarama-p-raju commented Aug 19, 2024

@blackpiglet

Thank you for your response.

The PV is provisioned statically using a manifest with a specific name. The PVC then defines use of the PV.
This is also a valid way of provisioning volume as you can find in the official documentation here.
Here are the manifests used in this use-case:

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: <pv-name>
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi
  csi:
    driver: file.csi.azure.com
    nodeStageSecretRef:
      name: file-share-secret
      namespace: <secret-ns>
    readOnly: false
    volumeAttributes:
      resourceGroup: <azure-resource-group>
      shareName: <share-name>
    volumeHandle: <azure-resource-group>-main-dev
  mountOptions:
  - dir_mode=0777
  - file_mode=0777
  - uid=0
  - gid=0
  - mfsymlinks
  - cache=strict
  - nosharesock
  - nobrl
  persistentVolumeReclaimPolicy: Retain
  storageClassName: azurefile-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: <pvc-name>
  namespace: <pvc-ns>
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: azurefile-csi
  volumeName: <pv-name>
---

Thank you for the link to the Azure file CSI code which shows the limitation on the format of the volume handle.

Could you please let me know your thoughts on what I could do in this case ?

@blackpiglet
Copy link
Contributor

Due to the CSI Azure File snapshotter limitation, and the volume was not created by the CSI way, I think we cannot back up the volume by the CSI way.
How about using the file system backup?
https://velero.io/docs/v1.14/file-system-backup/

@sivarama-p-raju
Copy link
Author

@blackpiglet

Thank you for the update. Yes, I plan to test the filesystem backup method for the problematic volumes.

I plan to annotate the deployments using the said volumes with the below, so that filesystem backup is done for those only:

backup.velero.io/backup-volumes: <volume name>

@sivarama-p-raju
Copy link
Author

@blackpiglet

Please note that I annotated the pods with the below annotation:

backup.velero.io/backup-volumes: <volume name>

On running a fresh backup, I see that it completed with the status "PartiallyFailed" and have the below errors on a velero backup describe:

Errors:
  Velero:    name: /<pod-name> message: /Error backing up item error: /daemonset pod not found in running state in node <aks-node>
             name: /<pod-name> message: /Error backing up item error: /daemonset pod not found in running state in node <aks-node>
             name: /<pod-name> message: /Error backing up item error: /daemonset pod not found in running state in node <aks-node>

Not sure I understand this error.
The pods are not part of a daemonset but are part of a deployment.

Could you please let me know your thoughts on the same ?

@sseago
Copy link
Collaborator

sseago commented Sep 6, 2024

@sivarama-p-raju Are you running the node agent? You told velero to use fs-backup for those pods, but if you're using fs-backup with kopia (or restic), then you need to run the node agent daemonset. From the error message, either the node agent isn't running at all, or it's not running on the nodes with your pods for some reason.

@sivarama-p-raju
Copy link
Author

@sseago Thank you for your reply. We did not have node agent running on the particular cluster. I enabled deploy of node agent and triggered a new backup after this.

There are new errors this time:

Errors:
  Velero:    name: /<pod-name> message: /Error backing up item error: /failed to wait BackupRepository, errored early: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system

On searching more on this, I found this issue, and tried doing the same.

The backups now complete successfully.

However, I notice that there are <ns>-default-kopia-<xxxxx>-maintain-job-.... now started failing with below errors:

time="2024-09-07T15:15:07Z" level=info msg="use the storage account URI retrieved from the storage account properties \"https://<storage-account>.blob.core.windows.net/\""
time="2024-09-07T15:15:07Z" level=info msg="auth with Azure AD"
time="2024-09-07T15:15:08Z" level=info msg="use the storage account URI retrieved from the storage account properties \"https://<storage-account>.blob.core.windows.net/\""
time="2024-09-07T15:15:08Z" level=info msg="auth with Azure AD"
time="2024-09-07T15:15:08Z" level=warning msg="active indexes [xn0_05f18fdeda88f2ebfc065202161e93f0-s745559b64bb7df4f12c-c1 xn0_256311ded6016c2ac402604a412c7e79-sb73cf658e60c314812c-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-09-07T15:15:08Z" level=info msg="Start to open repo for maintenance, allow index write on load" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:165"
time="2024-09-07T15:15:08Z" level=info msg="use the storage account URI retrieved from the storage account properties \"https://<storage-account>.blob.core.windows.net/\""
time="2024-09-07T15:15:08Z" level=info msg="auth with Azure AD"
time="2024-09-07T15:15:09Z" level=warning msg="active indexes [xn0_05f18fdeda88f2ebfc065202161e93f0-s745559b64bb7df4f12c-c1 xn0_256311ded6016c2ac402604a412c7e79-sb73cf658e60c314812c-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-09-07T15:15:09Z" level=info msg="Succeeded to open repo for maintenance" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:172"
time="2024-09-07T15:15:09Z" level=error msg="An error occurred when running repo prune" error="failed to prune repo: error to prune backup repo: error to maintain repo: error to run maintenance under mode auto: maintenance must be run by designated user: " error.file="/go/src/github.com/vmware-tanzu/velero/pkg/repository/udmrepo/kopialib/lib_repo.go:219" error.function="github.com/vmware-tanzu/velero/pkg/repository/udmrepo/kopialib.(*kopiaMaintenance).runMaintenance" logSource="pkg/cmd/cli/repomantenance/maintenance.go:72"

I needed your help with the below queries:

  • Is configuring the extraVolumes and extraVolumeMounts the correct solution to the backup error unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system ?
    • If yes, how should the problems with the maintain job be fixed ?
    • If no, what is the solution ?

Thanks a lot in advance.

@sivarama-p-raju
Copy link
Author

@sseago @blackpiglet
Could you please let me know your thoughts about the failing kopia maintain jobs that I mentioned ?

@sseago
Copy link
Collaborator

sseago commented Oct 14, 2024

@sivarama-p-raju Kopia must be able to write to $HOME -- in the usual default configurations this should be possible. Is the root filesystem mounted read-only?

@bernardgut
Copy link

@sivarama-p-raju
I had the same issue. It is either a bug with some CRDs that do not get deleted when you helm uninstall OR some race condition when node-agent start before velero-*. I am not sure exactly what is going on and I dont have time to debug.

In any case, here is the fix:

  1. delete the velero ns
  2. helm install... (because you did not label the ns, node-agent-* daemonset pods should NOT start
  3. wait for velero-... to start
  4. once velero is started, run k label ns velero pod-security.kubernetes.io/enforce=privileged. the node-agent-* pods start
  5. you should not see any maintenance jobs at all. You can now run velero backup create test --snapshot-move-data --include-namespaces <some-namespace-with-pvc> ... and it should work.

et voilà.

@sivarama-p-raju I am not sure you understand but there is a reason why everyone is trying to run your pods with containerSecurityContext and podSecurityContext set with minimum privilege. The velero pod has access to all your data basically. The node-agent-* pods can read/write into any application data on your cluster. The attack surface is insane.

@sivarama-p-raju
Copy link
Author

@bernardgut Thank you for the provided details.

As @sseago mentioned, it should work in the usual default configurations. However, I did not get chance to work on fixing this and I removed use of Kopia completely in our AKS clusters.

We still have a couple of volumes that are provisioned the way I decribed earlier (not dynamically provisioned) and we are still trying to figure out a way to back those up.

I will close this issue for now and try out the fix you mentioned.

@bernardgut
Copy link

bernardgut commented Dec 8, 2024

@sivarama-p-raju np.

Also here is my current config, in case it helps you or anyone else who needs to run velero without any priviledges successfully :

values.yaml

podSecurityContext:
  runAsUser: 1000
  runAsGroup: 1000
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault
containerSecurityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true

# add extra volumes and volume mounts for running with
# read-only root filesystem:
extraVolumes:
- emptyDir: {}
  name: udmrepo
- emptyDir: {}
  name: cache
extraVolumeMounts:
- mountPath: /udmrepo
  name: udmrepo
- mountPath: /.cache
  name: cache

kubectl:
  containerSecurityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
    readOnlyRootFilesystem: true


initContainers:
- name: velero-plugin-for-aws
  image: velero/velero-plugin-for-aws:v1.11.0
  volumeMounts:
    - mountPath: /target
      name: plugins
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
    readOnlyRootFilesystem: true

configuration:
  features: EnableCSI,EnableAPIGroupVersions
  backupStorageLocation:
  - name: s3.REDACTEDDOMAIN
    provider: aws
    bucket: p0-backup
    default: true
    config:
      region: minio
      s3ForcePathStyle: "true"
      s3Url: https://s3.REDACTEDDOMAIN
  volumeSnapshotLocation:
  - name: s3.REDACTEDDOMAIN
    provider: aws
    config:
      region: minio

deployNodeAgent: true

nodeAgent:
  priorityClassName: system-node-critical
  podSecurityContext:
   runAsUser: 1000
   runAsGroup: 1000
   fsGroup: 0
   runAsNonRoot: true
   seccompProfile:
     type: RuntimeDefault
  containerSecurityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
  extraArgs: 
    - --node-agent-configmap=concurrency-config

concurrency-config.json

{
    "loadConcurrency": {
        "globalConfig": 2
    }
}

install procedure:

k create cm concurrency-config -n velero --from-file=concurrency-config.json
helm upgrade --install -n velero  -f values.yaml velero vmware-tanzu/velero \                                     
  --set-file s3-home.secret \
  --create-namespace
# wait 30s for velero to come up then
k label ns velero pod-security.kubernetes.io/enforce=privileged 

I havent found a way to run the node-agent-..., worker-.. and maintenance-.. pods as non-root/unpriviledged. Please let me know if you find one.

Right now they bypass kubernetes API and therefore security (RBAC) entirely by mounting a hostpath of the /var/lib/kubelet/pods in every pod for volume data copying instead of dynamically mounting the volumes in the worker pods as needed. The problem line is here : https://github.com/vmware-tanzu/helm-charts/blob/f0f07defa8273b0493806032349539f948da40b4/charts/velero/templates/node-agent-daemonset.yaml#L73

If I have time I might create an issue as I think this is a huge security risk. But right now I am too busy,

Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Cloud/Azure Area/CSI Related to Container Storage Interface support Needs info Waiting for information
Projects
None yet
Development

No branches or pull requests

5 participants