-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impossible to restore PVC using CSI data mover on OKD cluster #8178
Comments
Could you describe the restored PVC and PV, looks like they are not in |
Hi! Thanks for your reply. 1. Restore$ velero restore get 2. List of PVC$ oc get pvc -A 3. Kubernetes events$ oc -n nginx-dev get ev 4. PVC details$ oc -n nginx-dev describe pvc nginx-pvc Warning ProvisioningFailed 56m persistentvolume-controller Error saving claim: Operation cannot be fulfilled on persistentvolumeclaims "nginx-pvc": the object has been modified; please apply your changes to the latest version and try again 5. PV details$ oc describe pv pvc-8775feb4-61fe-4496-b14d-10f79da07fd4 6. Restore PartiallyFailedAfter 4 hours I get: $ velero restore describe nginx-dev-2024-09-02-datamove-restore --details Phase: PartiallyFailed (run 'velero restore logs nginx-dev-2024-09-02-datamove-restore' for more information) Started: 2024-09-03 09:35:24 +0200 CEST Warnings: Errors: Backup: nginx-dev-2024-09-02-datamove Namespaces: Resources: Namespace mappings: Label selector: Or label selector: Restore PVs: auto CSI Snapshot Restores: Existing Resource Policy: Preserve Service NodePorts: auto Uploader config: Restore Item Operations: HooksAttempted: 0 Resource List: The previous PVC into openshift-adp namespace is disappeared: $ oc get pvc -A Also PV pvc-8775feb4-61fe-4496-b14d-10f79da07fd4 is disappeared. The PVC is always in pending state: oc -n nginx-dev describe pvc nginx-pvc Normal Provisioning 3m37s (x89 over 5h7m) pxd.portworx.com_px-csi-ext-5bf5fb4cdb-wb5cj_172f4298-15f8-4dff-9e07-1dbd6fd9e692 External provisioner is provisioning volume for claim "nginx-dev/nginx-pvc" |
How much data is to be restored? |
Please share the velero log bundle by running |
Hi! This my backup: velero backup describe nginx-dev-2024-09-02-datamove --details Phase: Completed Namespaces: Resources: Label selector: Or label selector: Storage Location: default Velero-Native Snapshot PVs: true TTL: 720h0m0s CSISnapshotTimeout: 10m0s Hooks: Backup Format Version: 1.1.0 Started: 2024-09-03 09:14:13 +0200 CEST Expiration: 2024-10-03 09:14:13 +0200 CEST Total items to be backed up: 111 Backup Item Operations: Backup Volumes: CSI Snapshots: Pod Volume Backups: HooksAttempted: 0 MinIO check object store contents: $ mc ls --summarize --recursive okdminio/okd-oadp-velero/kopia Total Size: 21 KiB $ mc ls --summarize --recursive okdminio/okd-oadp-velero/backups/nginx-dev-2024-09-02-datamove/ Total Size: 129 KiB $ velero debug --backup nginx-dev-2024-09-02-datamove --restore nginx-dev-2024-09-02-datamove-restore |
From the log, I see a DD created at
Looks like this DD was never handled by any controller and finally timeout. |
From the node-agent log, the restorePod never got to
@vincmarz Could you check the status of the restorePod created during the data movement? If it is not running, just describe it and see what problem blocks its running. |
Hi! It's the first time we're using CSI storage so we're exploring new possibility about CSI data moving. I retryed and I got the same results after 4 hours timeout: 1. Pod is pending$ oc get all -n nginx-dev NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE NAME READY UP-TO-DATE AVAILABLE AGE NAME DESIRED CURRENT READY AGE NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD $ oc describe po nginx-deployment-7754db9f48-zw2h5 -n nginx-dev Warning FailedScheduling 4h17m stork 0/10 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/10 nodes are available: 10 Preemption is not helpful for scheduling.. $ oc get ev -n nginx-dev 2. PVC is pending$ oc get pvc -n nginx-dev $ oc get pvc -n nginx-dev Normal ExternalProvisioning 4m50s (x1047 over 4h19m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'pxd.portworx.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered. 3. Velero status$ oc -n openshift-adp get all NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE NAME READY UP-TO-DATE AVAILABLE AGE NAME DESIRED CURRENT READY AGE NAME COMPLETIONS DURATION AGE $ oc -n openshift-adp get ev 4. Velero log |
@vincmarz |
Hi! This is what happened. PODoc -n openshift-adp get po -o wide PVCNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE But the pod still remain in ContainerCreating state because the chosen node is not appropriate for restore because is it a storageless node (e.g in my scenario, only infra node are with storage). So you can close or merge this issue because I see there is another issue about node selector useful during restore with data moving #8186. Thanks for your support. Best regards! |
@vincmarz I think this add a new use case to #8186, we want to have more inputs to prioritize our tasks. |
Hi @Lyndon-Li this is my cluster: $ oc get nodes We have 3 node for Portworx Cluster: And 4 storageless nodes: For these node we use a server NFS for any storage needs. For Portworx nodes, we use the following CSI storage class: $ oc get sc px-csi-db -o yaml |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands. |
This issue was closed because it has been stalled for 14 days with no activity. |
What steps did you take and what happened:
We are using velero 1.14.1 on OKD, with the data mover feature but our restore are partially failed:
What did you expect to happen:
Restore to complete successfully.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
Environment:
velero version
): 1.14.1velero client config get features
): features: EnableCSIkubectl version
): v1.28.7+6e2789b/etc/os-release
):NAME="Fedora Linux"
VERSION="39.20240210.3.0 (CoreOS)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora CoreOS 39.20240210.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-11-12
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='39.20240210.3.0'
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: