Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to restore PVC using CSI data mover on OKD cluster #8178

Closed
vincmarz opened this issue Sep 2, 2024 · 14 comments
Closed

Impossible to restore PVC using CSI data mover on OKD cluster #8178

vincmarz opened this issue Sep 2, 2024 · 14 comments

Comments

@vincmarz
Copy link

vincmarz commented Sep 2, 2024

What steps did you take and what happened:

We are using velero 1.14.1 on OKD, with the data mover feature but our restore are partially failed:

$ velero restore describe nginx-dev-2024-09-02-restore --details
Name:         nginx-dev-2024-09-02-restore
Namespace:    openshift-adp
Labels:       <none>
Annotations:  <none>

Phase:                       PartiallyFailed (run 'velero restore logs nginx-dev-2024-09-02-restore' for more information)
Total items to be restored:  43
Items restored:              43

Started:    2024-09-02 10:49:44 +0200 CEST
Completed:  2024-09-02 14:59:49 +0200 CEST

Warnings:
  Velero:     <none>
  Cluster:  could not restore, CustomResourceDefinition "volumesnapshots.snapshot.storage.k8s.io" already exists. Warning: the in-cluster version is different than the backed-up version
  Namespaces:
    nginx-dev:  could not restore, ConfigMap "kube-root-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version
                could not restore, ConfigMap "openshift-service-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version
                could not restore, RoleBinding "system:deployers" already exists. Warning: the in-cluster version is different than the backed-up version
                could not restore, RoleBinding "system:image-builders" already exists. Warning: the in-cluster version is different than the backed-up version
                could not restore, RoleBinding "system:image-pullers" already exists. Warning: the in-cluster version is different than the backed-up version
                could not restore, RoleBinding "admin" already exists. Warning: the in-cluster version is different than the backed-up version
                could not restore, RoleBinding "system:deployers" already exists. Warning: the in-cluster version is different than the backed-up version
                could not restore, RoleBinding "system:image-builders" already exists. Warning: the in-cluster version is different than the backed-up version
                could not restore, RoleBinding "system:image-pullers" already exists. Warning: the in-cluster version is different than the backed-up version
                could not restore, RoleBinding "system:openshift:scc:anyuid" already exists. Warning: the in-cluster version is different than the backed-up version

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    nginx-dev:  fail to patch dynamic PV, err: context deadline exceeded, PVC: nginx-pvc, PV: pvc-a0074f98-1d0b-47bd-a794-267b3bc510b9

Backup:  book-dev-test-2024-09-02-datamove

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Or label selector:  <none>

Restore PVs:  auto

CSI Snapshot Restores:
  nginx-dev/nginx-pvc:
    Data Movement:
      Operation ID: dd-a27e631f-76f0-4761-9834-61d13ea30280.a0074f98-1d0b-47b4fc259
      Data Mover: velero
      Uploader Type: kopia

Existing Resource Policy:   <none>
ItemOperationTimeout:       4h0m0s

Preserve Service NodePorts:  auto

Uploader config:

Restore Item Operations:
  Operation for persistentvolumeclaims nginx-dev/nginx-pvc:
    Restore Item Action Plugin:  velero.io/csi-pvc-restorer
    Operation ID:                dd-a27e631f-76f0-4761-9834-61d13ea30280.a0074f98-1d0b-47b4fc259
    Phase:                       Failed
    Operation Error:             Asynchronous action timed out
    Progress description:        Accepted
    Created:                     2024-09-02 10:49:48 +0200 CEST

HooksAttempted:   0
HooksFailed:      0

What did you expect to happen:

Restore to complete successfully.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version): 1.14.1
  • Velero features (use velero client config get features): features: EnableCSI
  • Kubernetes version (use kubectl version): v1.28.7+6e2789b
  • Kubernetes installer & version: OpenSHift OKD 4.15.0-0.okd-2024-03-10-010116
  • Cloud provider or hardware configuration: Microsoft HyperV
  • OS (e.g. from /etc/os-release):
    NAME="Fedora Linux"
    VERSION="39.20240210.3.0 (CoreOS)"
    ID=fedora
    VERSION_ID=39
    VERSION_CODENAME=""
    PLATFORM_ID="platform:f39"
    PRETTY_NAME="Fedora CoreOS 39.20240210.3.0"
    ANSI_COLOR="0;38;2;60;110;180"
    LOGO=fedora-logo-icon
    CPE_NAME="cpe:/o:fedoraproject:fedora:39"
    HOME_URL="https://getfedora.org/coreos/"
    DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
    SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
    BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
    REDHAT_BUGZILLA_PRODUCT="Fedora"
    REDHAT_BUGZILLA_PRODUCT_VERSION=39
    REDHAT_SUPPORT_PRODUCT="Fedora"
    REDHAT_SUPPORT_PRODUCT_VERSION=39
    SUPPORT_END=2024-11-12
    VARIANT="CoreOS"
    VARIANT_ID=coreos
    OSTREE_VERSION='39.20240210.3.0'

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

Could you describe the restored PVC and PV, looks like they are not in bound state.

@vincmarz
Copy link
Author

vincmarz commented Sep 3, 2024

Hi! Thanks for your reply.
I retry and I'll describe step by step my procedure.
While restore is in progress, I get:

1. Restore

$ velero restore get
NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR
nginx-dev-2024-09-02-datamove-restore nginx-dev-2024-09-02-datamove WaitingForPluginOperations 2024-09-03 09:35:24 +0200 CEST 0 10 2024-09-03 09:35:24 +0200 CEST

2. List of PVC

$ oc get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nginx-dev nginx-pvc Pending px-csi-db 87m
openshift-adp nginx-dev-2024-09-02-datamove-restore-fhjgf Bound pvc-8775feb4-61fe-4496-b14d-10f79da07fd4 1Gi RWO px-csi-db 87m

3. Kubernetes events

$ oc -n nginx-dev get ev
LAST SEEN TYPE REASON OBJECT MESSAGE
55m Warning FailedScheduling pod/nginx-deployment-76484dcb9d-2g2cw 0/10 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/10 nodes are available: 10 Preemption is not helpful for scheduling..
4m51s Warning FailedScheduling pod/nginx-deployment-76484dcb9d-2g2cw 0/10 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/10 nodes are available: 10 Preemption is not helpful for scheduling..
55m Warning ProvisioningFailed persistentvolumeclaim/nginx-pvc Error saving claim: Operation cannot be fulfilled on persistentvolumeclaims "nginx-pvc": the object has been modified; please apply your changes to the latest version and try again
2m47s Normal Provisioning persistentvolumeclaim/nginx-pvc External provisioner is provisioning volume for claim "nginx-dev/nginx-pvc"
32m Warning ProvisioningFailed persistentvolumeclaim/nginx-pvc failed to provision volume with StorageClass "px-csi-db": claim Selector is not supported
18s Normal ExternalProvisioning persistentvolumeclaim/nginx-pvc Waiting for a volume to be created either by the external provisioner 'pxd.portworx.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

4. PVC details

$ oc -n nginx-dev describe pvc nginx-pvc
Name: nginx-pvc
Namespace: nginx-dev
StorageClass: px-csi-db
Status: Pending
Volume:
Labels: velero.io/backup-name=nginx-dev-2024-09-02-datamove
velero.io/restore-name=nginx-dev-2024-09-02-datamove-restore
velero.io/volume-snapshot-name=velero-nginx-pvc-4xlv8
Annotations: backup.velero.io/must-include-additional-items: true
velero.io/csi-volumesnapshot-class: vsnapclasspxd
volume.beta.kubernetes.io/storage-provisioner: pxd.portworx.com
volume.kubernetes.io/storage-provisioner: pxd.portworx.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: nginx-deployment-76484dcb9d-2g2cw
Events:
Type Reason Age From Message

Warning ProvisioningFailed 56m persistentvolume-controller Error saving claim: Operation cannot be fulfilled on persistentvolumeclaims "nginx-pvc": the object has been modified; please apply your changes to the latest version and try again
Warning ProvisioningFailed 33m (x14 over 56m) pxd.portworx.com_px-csi-ext-5bf5fb4cdb-wb5cj_172f4298-15f8-4dff-9e07-1dbd6fd9e692 failed to provision volume with StorageClass "px-csi-db": claim Selector is not supported
Normal Provisioning 3m30s (x22 over 56m) pxd.portworx.com_px-csi-ext-5bf5fb4cdb-wb5cj_172f4298-15f8-4dff-9e07-1dbd6fd9e692 External provisioner is provisioning volume for claim "nginx-dev/nginx-pvc"
Normal ExternalProvisioning 61s (x227 over 56m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'pxd.portworx.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

5. PV details

$ oc describe pv pvc-8775feb4-61fe-4496-b14d-10f79da07fd4
Name: pvc-8775feb4-61fe-4496-b14d-10f79da07fd4
Labels:
Annotations: pv.kubernetes.io/provisioned-by: pxd.portworx.com
volume.kubernetes.io/provisioner-deletion-secret-name:
volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers: [kubernetes.io/pv-protection]
StorageClass: px-csi-db
Status: Bound
Claim: openshift-adp/nginx-dev-2024-09-02-datamove-restore-fhjgf
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 1Gi
Node Affinity:
Message:
Source:
Type: CSI (a Container Storage Interface (CSI) volume source)
Driver: pxd.portworx.com
FSType: ext4
VolumeHandle: 1075204168189915209
ReadOnly: false
VolumeAttributes: attached=ATTACH_STATE_INTERNAL_SWITCH
error=
parent=
readonly=false
secure=false
shared=false
sharedv4=false
state=VOLUME_STATE_DETACHED
storage.kubernetes.io/csiProvisionerIdentity=1724280818749-2702-pxd.portworx.com
Events:

6. Restore PartiallyFailed

After 4 hours I get:

$ velero restore describe nginx-dev-2024-09-02-datamove-restore --details
Name: nginx-dev-2024-09-02-datamove-restore
Namespace: openshift-adp
Labels:
Annotations:

Phase: PartiallyFailed (run 'velero restore logs nginx-dev-2024-09-02-datamove-restore' for more information)
Total items to be restored: 42
Items restored: 42

Started: 2024-09-03 09:35:24 +0200 CEST
Completed: 2024-09-03 13:45:35 +0200 CEST

Warnings:
Velero:
Cluster:
Namespaces:
nginx-dev: could not restore, ConfigMap "kube-root-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, ConfigMap "openshift-service-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "system:deployers" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "system:image-builders" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "system:image-pullers" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "admin" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "system:deployers" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "system:image-builders" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "system:image-pullers" already exists. Warning: the in-cluster version is different than the backed-up version
could not restore, RoleBinding "system:openshift:scc:anyuid" already exists. Warning: the in-cluster version is different than the backed-up version

Errors:
Velero:
Cluster:
Namespaces:
nginx-dev: fail to patch dynamic PV, err: context deadline exceeded, PVC: nginx-pvc, PV: pvc-ab8333bf-3f92-4685-bf4d-24234abca090

Backup: nginx-dev-2024-09-02-datamove

Namespaces:
Included: all namespaces found in the backup
Excluded:

Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
Cluster-scoped: auto

Namespace mappings:

Label selector:

Or label selector:

Restore PVs: auto

CSI Snapshot Restores:
nginx-dev/nginx-pvc:
Data Movement:
Operation ID: dd-7d42f4bd-971e-406d-bfd6-a1159da0a98e.ab8333bf-3f92-4689ca2a1
Data Mover: velero
Uploader Type: kopia

Existing Resource Policy:
ItemOperationTimeout: 4h0m0s

Preserve Service NodePorts: auto

Uploader config:

Restore Item Operations:
Operation for persistentvolumeclaims nginx-dev/nginx-pvc:
Restore Item Action Plugin: velero.io/csi-pvc-restorer
Operation ID: dd-7d42f4bd-971e-406d-bfd6-a1159da0a98e.ab8333bf-3f92-4689ca2a1
Phase: Failed
Operation Error: Asynchronous action timed out
Progress description: Accepted
Created: 2024-09-03 09:35:28 +0200 CEST

HooksAttempted: 0
HooksFailed: 0

Resource List:
apps/v1/Deployment:
- nginx-dev/nginx-deployment(created)
apps/v1/ReplicaSet:
- nginx-dev/nginx-deployment-76484dcb9d(created)
authorization.openshift.io/v1/RoleBinding:
- nginx-dev/admin(created)
- nginx-dev/system:deployers(failed)
- nginx-dev/system:image-builders(failed)
- nginx-dev/system:image-pullers(failed)
- nginx-dev/system:openshift:scc:anyuid(created)
discovery.k8s.io/v1/EndpointSlice:
- nginx-dev/nginx-4pxlb(created)
rbac.authorization.k8s.io/v1/RoleBinding:
- nginx-dev/admin(failed)
- nginx-dev/system:deployers(failed)
- nginx-dev/system:image-builders(failed)
- nginx-dev/system:image-pullers(failed)
- nginx-dev/system:openshift:scc:anyuid(failed)
route.openshift.io/v1/Route:
- nginx-dev/nginx(created)
v1/ConfigMap:
- nginx-dev/kube-root-ca.crt(failed)
- nginx-dev/openshift-service-ca.crt(failed)
v1/Endpoints:
- nginx-dev/nginx(created)
v1/Namespace:
- nginx-dev(created)
v1/PersistentVolume:
- pvc-ab8333bf-3f92-4685-bf4d-24234abca090(skipped)
v1/PersistentVolumeClaim:
- nginx-dev/nginx-pvc(created)
v1/Pod:
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw(created)
v1/Secret:
- nginx-dev/builder-dockercfg-55s2p(created)
- nginx-dev/builder-dockercfg-cqcc5(created)
- nginx-dev/builder-dockercfg-r8gv5(created)
- nginx-dev/builder-dockercfg-rx2qd(created)
- nginx-dev/builder-token-5qq56(skipped)
- nginx-dev/default-dockercfg-2clrp(created)
- nginx-dev/default-dockercfg-44cts(created)
- nginx-dev/default-dockercfg-q9kzq(created)
- nginx-dev/default-token-6h7z5(skipped)
- nginx-dev/deployer-dockercfg-hbrfm(created)
- nginx-dev/deployer-dockercfg-hfg7t(created)
- nginx-dev/deployer-dockercfg-snh9f(created)
- nginx-dev/deployer-dockercfg-zmsbw(created)
- nginx-dev/deployer-token-bd7jl(skipped)
- nginx-dev/nginx-dockercfg-9t92h(created)
v1/Service:
- nginx-dev/nginx(created)
v1/ServiceAccount:
- nginx-dev/builder(updated)
- nginx-dev/default(updated)
- nginx-dev/deployer(updated)
- nginx-dev/nginx(created)
velero.io/v2alpha1/DataUpload:
- openshift-adp/nginx-dev-2024-09-02-datamove-v9rdd(skipped)

The previous PVC into openshift-adp namespace is disappeared:

$ oc get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nginx-dev nginx-pvc Pending px-csi-db 5h4m

Also PV pvc-8775feb4-61fe-4496-b14d-10f79da07fd4 is disappeared.

The PVC is always in pending state:

oc -n nginx-dev describe pvc nginx-pvc
Name: nginx-pvc
Namespace: nginx-dev
StorageClass: px-csi-db
Status: Pending
Volume:
Labels: velero.io/backup-name=nginx-dev-2024-09-02-datamove
velero.io/restore-name=nginx-dev-2024-09-02-datamove-restore
velero.io/volume-snapshot-name=velero-nginx-pvc-4xlv8
Annotations: backup.velero.io/must-include-additional-items: true
velero.io/csi-volumesnapshot-class: vsnapclasspxd
volume.beta.kubernetes.io/storage-provisioner: pxd.portworx.com
volume.kubernetes.io/storage-provisioner: pxd.portworx.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: nginx-deployment-76484dcb9d-2g2cw
Events:
Type Reason Age From Message

Normal Provisioning 3m37s (x89 over 5h7m) pxd.portworx.com_px-csi-ext-5bf5fb4cdb-wb5cj_172f4298-15f8-4dff-9e07-1dbd6fd9e692 External provisioner is provisioning volume for claim "nginx-dev/nginx-pvc"
Normal ExternalProvisioning 2m7s (x1252 over 5h7m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'pxd.portworx.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

@Lyndon-Li
Copy link
Contributor

How much data is to be restored?

@Lyndon-Li
Copy link
Contributor

Please share the velero log bundle by running velero debug

@vincmarz
Copy link
Author

vincmarz commented Sep 3, 2024

Hi! This my backup:
$ velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
nginx-dev-2024-09-02-datamove Completed 0 0 2024-09-03 09:14:13 +0200 CEST 29d default

velero backup describe nginx-dev-2024-09-02-datamove --details
Name: nginx-dev-2024-09-02-datamove
Namespace: openshift-adp
Labels: velero.io/storage-location=default
Annotations: velero.io/resource-timeout=10m0s
velero.io/source-cluster-k8s-gitversion=v1.28.2-3598+6e2789bbd58938-dirty
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=28+

Phase: Completed

Namespaces:
Included: nginx-dev
Excluded:

Resources:
Included: *
Excluded:
Cluster-scoped: auto

Label selector:

Or label selector:

Storage Location: default

Velero-Native Snapshot PVs: true
Snapshot Move Data: true
Data Mover: velero

TTL: 720h0m0s

CSISnapshotTimeout: 10m0s
ItemOperationTimeout: 4h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2024-09-03 09:14:13 +0200 CEST
Completed: 2024-09-03 09:14:34 +0200 CEST

Expiration: 2024-10-03 09:14:13 +0200 CEST

Total items to be backed up: 111
Items backed up: 111

Backup Item Operations:
Operation for persistentvolumeclaims nginx-dev/nginx-pvc:
Backup Item Action Plugin: velero.io/csi-pvc-backupper
Operation ID: du-15c8f52f-bf10-4b40-b26b-e01bf1bcdd3a.ab8333bf-3f92-468fe9997
Items to Update:
datauploads.velero.io openshift-adp/nginx-dev-2024-09-02-datamove-v9rdd
Phase: Completed
Progress: 617 of 617 complete (Bytes)
Progress description: Completed
Created: 2024-09-03 09:14:19 +0200 CEST
Started: 2024-09-03 09:14:29 +0200 CEST
Updated: 2024-09-03 09:14:32 +0200 CEST
Resource List:
apps/v1/Deployment:
- nginx-dev/nginx-deployment
apps/v1/ReplicaSet:
- nginx-dev/nginx-deployment-76484dcb9d
authorization.openshift.io/v1/RoleBinding:
- nginx-dev/admin
- nginx-dev/system:deployers
- nginx-dev/system:image-builders
- nginx-dev/system:image-pullers
- nginx-dev/system:openshift:scc:anyuid
discovery.k8s.io/v1/EndpointSlice:
- nginx-dev/nginx-4pxlb
rbac.authorization.k8s.io/v1/RoleBinding:
- nginx-dev/admin
- nginx-dev/system:deployers
- nginx-dev/system:image-builders
- nginx-dev/system:image-pullers
- nginx-dev/system:openshift:scc:anyuid
route.openshift.io/v1/Route:
- nginx-dev/nginx
v1/ConfigMap:
- nginx-dev/kube-root-ca.crt
- nginx-dev/openshift-service-ca.crt
v1/Endpoints:
- nginx-dev/nginx
v1/Event:
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw.17f1612ffdea76a6
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw.17f1a75ea05c047b
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw.17f1a7610d889411
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw.17f1a7610e19d318
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw.17f1a7614abef33e
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw.17f1a7614da5277e
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw.17f1a761d6dba899
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw.17f1a761dd6deb0a
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw.17f1a761df0bdeda
- nginx-dev/nginx-pvc.17f16132268c095a
- nginx-dev/nginx-pvc.17f1613226922609
- nginx-dev/nginx-pvc.17f1a75ea20267e4
- nginx-dev/nginx-pvc.17f1a75ea22c0555
- nginx-dev/nginx-pvc.17f1a75ec2f3f365
- nginx-dev/velero-nginx-pvc-2qmpw.17f1a7f0f5824742
- nginx-dev/velero-nginx-pvc-2qmpw.17f1a7f0f77902ea
- nginx-dev/velero-nginx-pvc-2qmpw.17f1a7f0f7cd23ae
- nginx-dev/velero-nginx-pvc-2qmpw.17f1a7f13953f0c0
- nginx-dev/velero-nginx-pvc-2qmpw.17f1a7f1395410c8
- nginx-dev/velero-nginx-pvc-2qmpw.17f1a7f139b12188
- nginx-dev/velero-nginx-pvc-2qmpw.17f1a7f139b16710
- nginx-dev/velero-nginx-pvc-6jwg6.17f1a7c7a0c0fc56
- nginx-dev/velero-nginx-pvc-6jwg6.17f1a7c7a1ac69f0
- nginx-dev/velero-nginx-pvc-6jwg6.17f1a7c7a5827096
- nginx-dev/velero-nginx-pvc-6jwg6.17f1a7c7e3d4a20f
- nginx-dev/velero-nginx-pvc-6jwg6.17f1a7c7e3d4eb1b
- nginx-dev/velero-nginx-pvc-6jwg6.17f1a7c7e41529a9
- nginx-dev/velero-nginx-pvc-6jwg6.17f1a7c7e4155ec9
- nginx-dev/velero-nginx-pvc-6jwg6.17f1a7c7e6df82c1
- nginx-dev/velero-nginx-pvc-7bfhm.17f1aa23a2afe645
- nginx-dev/velero-nginx-pvc-7bfhm.17f1aa23a3fc29d3
- nginx-dev/velero-nginx-pvc-7bfhm.17f1aa23a47800fa
- nginx-dev/velero-nginx-pvc-7bfhm.17f1aa23e7148c4b
- nginx-dev/velero-nginx-pvc-7bfhm.17f1aa23e714b3bf
- nginx-dev/velero-nginx-pvc-7bfhm.17f1aa23e735086a
- nginx-dev/velero-nginx-pvc-7bfhm.17f1aa23e73548de
- nginx-dev/velero-nginx-pvc-7z4jn.17f1aa70b6e25bd2
- nginx-dev/velero-nginx-pvc-7z4jn.17f1aa70b9e483fc
- nginx-dev/velero-nginx-pvc-7z4jn.17f1aa70bb0ae21f
- nginx-dev/velero-nginx-pvc-7z4jn.17f1aa70fb129f78
- nginx-dev/velero-nginx-pvc-7z4jn.17f1aa70fb12d948
- nginx-dev/velero-nginx-pvc-7z4jn.17f1aa70fe99daa4
- nginx-dev/velero-nginx-pvc-d8znc.17f1a78077541df8
- nginx-dev/velero-nginx-pvc-d8znc.17f1a78079e476ac
- nginx-dev/velero-nginx-pvc-d8znc.17f1a780bc37d807
- nginx-dev/velero-nginx-pvc-d8znc.17f1a780bc380877
- nginx-dev/velero-nginx-pvc-f49rd.17f1a7b2114a6f39
- nginx-dev/velero-nginx-pvc-f49rd.17f1a7b212a2d1da
- nginx-dev/velero-nginx-pvc-f49rd.17f1a7b2acafec24
- nginx-dev/velero-nginx-pvc-f49rd.17f1a7b2acb02b6c
- nginx-dev/velero-nginx-pvc-jmrqw.17f1a7cf74068711
- nginx-dev/velero-nginx-pvc-jmrqw.17f1a7cf75d318c3
- nginx-dev/velero-nginx-pvc-jmrqw.17f1a7cf785abf05
- nginx-dev/velero-nginx-pvc-jmrqw.17f1a7cfba9f1533
- nginx-dev/velero-nginx-pvc-jmrqw.17f1a7cfba9f359f
- nginx-dev/velero-nginx-pvc-jmrqw.17f1a7cfbb2a8961
- nginx-dev/velero-nginx-pvc-jmrqw.17f1a7cfbb2ad849
- nginx-dev/velero-nginx-pvc-txpxl.17f1a82edb737c69
- nginx-dev/velero-nginx-pvc-txpxl.17f1a82edcb4db07
- nginx-dev/velero-nginx-pvc-txpxl.17f1a82f1efdb2bb
- nginx-dev/velero-nginx-pvc-txpxl.17f1a82f1efdea97
- nginx-dev/velero-nginx-pvc-txpxl.17f1a82f1f636905
- nginx-dev/velero-nginx-pvc-txpxl.17f1a82f1f63a2d5
- nginx-dev/velero-nginx-pvc-z6f9t.17f1a7fdbe580d31
- nginx-dev/velero-nginx-pvc-z6f9t.17f1a7fdc22c03aa
- nginx-dev/velero-nginx-pvc-z6f9t.17f1a7fdc3225095
- nginx-dev/velero-nginx-pvc-z6f9t.17f1a7fe074acad3
- nginx-dev/velero-nginx-pvc-z6f9t.17f1a7fe074b0cd7
- nginx-dev/velero-nginx-pvc-z6f9t.17f1a7fe07ac581b
- nginx-dev/velero-nginx-pvc-z6f9t.17f1a7fe07ac8cd7
v1/Namespace:
- nginx-dev
v1/PersistentVolume:
- pvc-ab8333bf-3f92-4685-bf4d-24234abca090
v1/PersistentVolumeClaim:
- nginx-dev/nginx-pvc
v1/Pod:
- nginx-dev/nginx-deployment-76484dcb9d-2g2cw
v1/Secret:
- nginx-dev/builder-dockercfg-55s2p
- nginx-dev/builder-dockercfg-cqcc5
- nginx-dev/builder-dockercfg-r8gv5
- nginx-dev/builder-dockercfg-rx2qd
- nginx-dev/builder-token-5qq56
- nginx-dev/default-dockercfg-2clrp
- nginx-dev/default-dockercfg-44cts
- nginx-dev/default-dockercfg-q9kzq
- nginx-dev/default-token-6h7z5
- nginx-dev/deployer-dockercfg-hbrfm
- nginx-dev/deployer-dockercfg-hfg7t
- nginx-dev/deployer-dockercfg-snh9f
- nginx-dev/deployer-dockercfg-zmsbw
- nginx-dev/deployer-token-bd7jl
- nginx-dev/nginx-dockercfg-9t92h
v1/Service:
- nginx-dev/nginx
v1/ServiceAccount:
- nginx-dev/builder
- nginx-dev/default
- nginx-dev/deployer
- nginx-dev/nginx

Backup Volumes:
Velero-Native Snapshots:

CSI Snapshots:
nginx-dev/nginx-pvc:
Data Movement:
Operation ID: du-15c8f52f-bf10-4b40-b26b-e01bf1bcdd3a.ab8333bf-3f92-468fe9997
Data Mover: velero
Uploader Type: kopia
Moved data Size (bytes): 617

Pod Volume Backups:

HooksAttempted: 0
HooksFailed: 0

MinIO check object store contents:

$ mc ls --summarize --recursive okdminio/okd-oadp-velero/kopia
[2024-09-03 09:14:30 CEST] 771B STANDARD nginx-dev/_log_20240903071430_6d08_1725347670_1725347670_1_e31b312ab7c3137b04e16d84c40629a2
[2024-09-03 09:14:32 CEST] 1.3KiB STANDARD nginx-dev/_log_20240903071431_9af2_1725347671_1725347672_1_f71e7c8d2880e324544acb29c822592f
[2024-09-03 09:14:30 CEST] 30B STANDARD nginx-dev/kopia.blobcfg
[2024-09-03 09:14:30 CEST] 1.0KiB STANDARD nginx-dev/kopia.repository
[2024-09-03 09:14:32 CEST] 4.2KiB STANDARD nginx-dev/pa594dbabea29edeff9ec798780c48033-s63e0fd2f73f3729d12c
[2024-09-03 09:14:32 CEST] 4.2KiB STANDARD nginx-dev/q31cacb7fe126bbb989b60d4bc01f1c1f-s63e0fd2f73f3729d12c
[2024-09-03 09:14:31 CEST] 4.2KiB STANDARD nginx-dev/q83600f71258357d3583a50ed10b0a053-s9fccfdd5cd56691312c
[2024-09-03 09:14:30 CEST] 4.2KiB STANDARD nginx-dev/qb967aba306e0236894cff8e5c9508e78-s02a22606d6d2c56412c
[2024-09-03 09:14:31 CEST] 143B STANDARD nginx-dev/xn0_2fab18c0367eee8c99e5efbfc10578e2-s9fccfdd5cd56691312c-c1
[2024-09-03 09:14:30 CEST] 143B STANDARD nginx-dev/xn0_324dcc1bfea5de9569bc56e6757cce53-s02a22606d6d2c56412c-c1
[2024-09-03 09:14:32 CEST] 311B STANDARD nginx-dev/xn0_d25cf318d416145514270921f9bce4f4-s63e0fd2f73f3729d12c-c1

Total Size: 21 KiB
Total Objects: 11

$ mc ls --summarize --recursive okdminio/okd-oadp-velero/backups/nginx-dev-2024-09-02-datamove/
[2024-09-03 09:14:19 CEST] 29B STANDARD nginx-dev-2024-09-02-datamove-csi-volumesnapshotclasses.json.gz
[2024-09-03 09:14:19 CEST] 29B STANDARD nginx-dev-2024-09-02-datamove-csi-volumesnapshotcontents.json.gz
[2024-09-03 09:14:19 CEST] 29B STANDARD nginx-dev-2024-09-02-datamove-csi-volumesnapshots.json.gz
[2024-09-03 09:14:33 CEST] 386B STANDARD nginx-dev-2024-09-02-datamove-itemoperations.json.gz
[2024-09-03 09:14:19 CEST] 13KiB STANDARD nginx-dev-2024-09-02-datamove-logs.gz
[2024-09-03 09:14:19 CEST] 29B STANDARD nginx-dev-2024-09-02-datamove-podvolumebackups.json.gz
[2024-09-03 09:14:19 CEST] 1.1KiB STANDARD nginx-dev-2024-09-02-datamove-resource-list.json.gz
[2024-09-03 09:14:19 CEST] 49B STANDARD nginx-dev-2024-09-02-datamove-results.gz
[2024-09-03 09:14:34 CEST] 425B STANDARD nginx-dev-2024-09-02-datamove-volumeinfo.json.gz
[2024-09-03 09:14:19 CEST] 29B STANDARD nginx-dev-2024-09-02-datamove-volumesnapshots.json.gz
[2024-09-03 09:14:34 CEST] 111KiB STANDARD nginx-dev-2024-09-02-datamove.tar.gz
[2024-09-03 09:14:34 CEST] 3.4KiB STANDARD velero-backup.json

Total Size: 129 KiB
Total Objects: 12

$ velero debug --backup nginx-dev-2024-09-02-datamove --restore nginx-dev-2024-09-02-datamove-restore
2024/09/03 16:37:04 Collecting velero resources in namespace: openshift-adp
2024/09/03 16:37:05 Collecting velero deployment logs in namespace: openshift-adp
2024/09/03 16:37:06 Collecting log and information for backup: nginx-dev-2024-09-02-datamove
2024/09/03 16:37:07 Collecting log and information for restore: nginx-dev-2024-09-02-datamove-restore
2024/09/03 16:37:07 Generated debug information bundle: /home/okdadmin/bundle-2024-09-03-16-37-04.tar.gz
bundle-2024-09-03-16-37-04.tar.gz

@Lyndon-Li
Copy link
Contributor

From the log, I see a DD created at 2024-09-03T07:35:28Z but not handled by any node-agent in 4 hours so it was cancelled:

            "metadata": {
                "creationTimestamp": "2024-09-03T07:35:28Z",
                "generateName": "nginx-dev-2024-09-02-datamove-restore-",
            "status": {
                "completionTimestamp": "2024-09-03T11:35:35Z",
                "phase": "Canceled",
                "progress": {},
                "startTimestamp": "2024-09-03T11:35:35Z"
            }

Looks like this DD was never handled by any controller and finally timeout.

@Lyndon-Li
Copy link
Contributor

From the node-agent log, the restorePod never got to running status, so the data movement had never started:

time="2024-09-03T07:35:28Z" level=info msg="Accepting data download nginx-dev-2024-09-02-datamove-restore-fhjgf" controller=DataDownload logSource="pkg/controller/data_download_controller.go:667"
time="2024-09-03T07:35:28Z" level=info msg="This datadownload has been accepted by infra-03.ocp4.policlinico.org" DataDownload=nginx-dev-2024-09-02-datamove-restore-fhjgf controller=DataDownload logSource="pkg/controller/data_download_controller.go:692"
time="2024-09-03T07:35:28Z" level=info msg="Data download is accepted" controller=datadownload datadownload=openshift-adp/nginx-dev-2024-09-02-datamove-restore-fhjgf logSource="pkg/controller/data_download_controller.go:167"
time="2024-09-03T07:35:28Z" level=info msg="Target PVC is consumed" logSource="pkg/exposer/generic_restore.go:84" owner=nginx-dev-2024-09-02-datamove-restore-fhjgf selected node= source namespace=nginx-dev target PVC=nginx-pvc
time="2024-09-03T07:35:28Z" level=info msg="Restore pod is created" logSource="pkg/exposer/generic_restore.go:95" owner=nginx-dev-2024-09-02-datamove-restore-fhjgf pod name=nginx-dev-2024-09-02-datamove-restore-fhjgf source namespace=nginx-dev target PVC=nginx-pvc
time="2024-09-03T07:35:28Z" level=info msg="Restore PVC is created" logSource="pkg/exposer/generic_restore.go:108" owner=nginx-dev-2024-09-02-datamove-restore-fhjgf pvc name=nginx-dev-2024-09-02-datamove-restore-fhjgf source namespace=nginx-dev target PVC=nginx-pvc
time="2024-09-03T07:35:28Z" level=info msg="Restore is exposed" controller=datadownload datadownload=openshift-adp/nginx-dev-2024-09-02-datamove-restore-fhjgf logSource="pkg/controller/data_download_controller.go:195"

@vincmarz Could you check the status of the restorePod created during the data movement? If it is not running, just describe it and see what problem blocks its running.

@Lyndon-Li Lyndon-Li self-assigned this Sep 4, 2024
@vincmarz
Copy link
Author

vincmarz commented Sep 4, 2024

Hi! It's the first time we're using CSI storage so we're exploring new possibility about CSI data moving. I retryed and I got the same results after 4 hours timeout:

1. Pod is pending

$ oc get all -n nginx-dev
W0904 15:41:34.167338 575746 warnings.go:70] apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME READY STATUS RESTARTS AGE
pod/nginx-deployment-7754db9f48-zw2h5 0/1 Pending 0 4h13m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/nginx ClusterIP 172.30.252.244 80/TCP 4h13m

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/nginx-deployment 0/1 1 0 4h13m

NAME DESIRED CURRENT READY AGE
replicaset.apps/nginx-deployment-76484dcb9d 0 0 0 4h13m
replicaset.apps/nginx-deployment-7754db9f48 1 1 0 4h13m

NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
route.route.openshift.io/nginx nginx-nginx-dev.apps.ocp4.policlinico.org nginx 80 None

$ oc describe po nginx-deployment-7754db9f48-zw2h5 -n nginx-dev
Name: nginx-deployment-7754db9f48-zw2h5
Namespace: nginx-dev
Priority: 0
Node:
Labels: app=nginx
pod-template-hash=7754db9f48
velero.io/backup-name=nginx-dev-2024-09-02-datamove
velero.io/restore-name=nginx-dev-2024-09-02-datamove-restore
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.129.2.30/23"],"mac_address":"0a:58:0a:81:02:1e","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.128.0.0...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.129.2.30"
],
"mac": "0a:58:0a:81:02:1e",
"default": true,
"dns": {}
}]
openshift.io/scc: anyuid
Status: Pending
IP:
IPs:
Controlled By: ReplicaSet/nginx-deployment-7754db9f48
Containers:
container:
Image: nginx
Port: 80/TCP
Host Port: 0/TCP
Environment:
Mounts:
/var/log/nginx from nginx-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pr2p2 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
nginx-storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nginx-pvc
ReadOnly: false
kube-api-access-pr2p2:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional:
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message

Warning FailedScheduling 4h17m stork 0/10 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/10 nodes are available: 10 Preemption is not helpful for scheduling..

$ oc get ev -n nginx-dev
LAST SEEN TYPE REASON OBJECT MESSAGE
22m Warning FailedScheduling pod/nginx-deployment-7754db9f48-zw2h5 0/10 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/10 nodes are available: 10 Preemption is not helpful for scheduling..
4m23s Normal Provisioning persistentvolumeclaim/nginx-pvc External provisioner is provisioning volume for claim "nginx-dev/nginx-pvc"
2m53s Normal ExternalProvisioning persistentvolumeclaim/nginx-pvc Waiting for a volume to be created either by the external provisioner 'pxd.portworx.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

2. PVC is pending

$ oc get pvc -n nginx-dev
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nginx-pvc Pending px-csi-db 4h18m

$ oc get pvc -n nginx-dev
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nginx-pvc Pending px-csi-db 4h18m
[okdadmin@puidc1vokdbast01 demo-nginx]$ oc describe pvc nginx-pvc -n nginx-dev
Name: nginx-pvc
Namespace: nginx-dev
StorageClass: px-csi-db
Status: Pending
Volume:
Labels: velero.io/backup-name=nginx-dev-2024-09-02-datamove
velero.io/restore-name=nginx-dev-2024-09-02-datamove-restore
velero.io/volume-snapshot-name=velero-nginx-pvc-bq79n
Annotations: backup.velero.io/must-include-additional-items: true
velero.io/csi-volumesnapshot-class: vsnapclasspxd
volume.beta.kubernetes.io/storage-provisioner: pxd.portworx.com
volume.kubernetes.io/storage-provisioner: pxd.portworx.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: nginx-deployment-7754db9f48-zw2h5
Events:
Type Reason Age From Message

Normal ExternalProvisioning 4m50s (x1047 over 4h19m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'pxd.portworx.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Normal Provisioning 80s (x77 over 4h19m) pxd.portworx.com_px-csi-ext-5bf5fb4cdb-wb5cj_172f4298-15f8-4dff-9e07-1dbd6fd9e692 External provisioner is provisioning volume for claim "nginx-dev/nginx-pvc"

3. Velero status

$ oc -n openshift-adp get all
W0904 15:48:38.709766 575927 warnings.go:70] apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME READY STATUS RESTARTS AGE
pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165-h9ws4 0/1 Completed 0 158m
pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174-5dlxt 0/1 Completed 0 98m
pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182-wn2kv 0/1 Completed 0 38m
pod/node-agent-bcvsf 1/1 Running 0 6h18m
pod/node-agent-c2xp2 1/1 Running 0 6h18m
pod/node-agent-cffjh 1/1 Running 0 6h18m
pod/node-agent-dfph5 1/1 Running 0 6h18m
pod/node-agent-kpm9w 1/1 Running 0 6h18m
pod/node-agent-pdjst 1/1 Running 0 6h18m
pod/node-agent-v42gx 1/1 Running 0 6h18m
pod/velero-86c6c965fd-g8rvw 1/1 Running 0 6h18m

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/node-agent 7 7 7 7 7 6h18m

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/velero 1/1 1 1 6h18m

NAME DESIRED CURRENT READY AGE
replicaset.apps/velero-86c6c965fd 1 1 1 6h18m

NAME COMPLETIONS DURATION AGE
job.batch/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165 1/1 4s 158m
job.batch/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174 1/1 4s 98m
job.batch/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182 1/1 5s 38m

$ oc -n openshift-adp get ev
LAST SEEN TYPE REASON OBJECT MESSAGE
26m Warning FailedMount pod/nginx-dev-2024-09-02-datamove-restore-blwm8 MountVolume.MountDevice failed for volume "pvc-18f94adb-76f6-40b3-a114-e107f85bace7" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name pxd.portworx.com not found in the list of registered CSI drivers
160m Normal Scheduled pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165-h9ws4 Successfully assigned openshift-adp/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165-h9ws4 to worker-02.ocp4.policlinico.org
160m Normal AddedInterface pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165-h9ws4 Add eth0 [10.131.0.127/23] from ovn-kubernetes
160m Normal Pulled pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165-h9ws4 Container image "velero/velero:v1.14.1" already present on machine
160m Normal Created pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165-h9ws4 Created container velero-repo-maintenance-container
160m Normal Started pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165-h9ws4 Started container velero-repo-maintenance-container
160m Normal SuccessfulCreate job/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165 Created pod: nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165-h9ws4
160m Normal Completed job/nginx-dev-default-kopia-9mdpf-maintain-job-1725448191165 Job completed
100m Normal Scheduled pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174-5dlxt Successfully assigned openshift-adp/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174-5dlxt to worker-02.ocp4.policlinico.org
100m Normal AddedInterface pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174-5dlxt Add eth0 [10.131.0.128/23] from ovn-kubernetes
100m Normal Pulled pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174-5dlxt Container image "velero/velero:v1.14.1" already present on machine
100m Normal Created pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174-5dlxt Created container velero-repo-maintenance-container
100m Normal Started pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174-5dlxt Started container velero-repo-maintenance-container
100m Normal SuccessfulCreate job/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174 Created pod: nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174-5dlxt
100m Normal Completed job/nginx-dev-default-kopia-9mdpf-maintain-job-1725451791174 Job completed
40m Normal Scheduled pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182-wn2kv Successfully assigned openshift-adp/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182-wn2kv to worker-04.ocp4.policlinico.org
40m Normal AddedInterface pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182-wn2kv Add eth0 [10.128.4.236/23] from ovn-kubernetes
40m Normal Pulled pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182-wn2kv Container image "velero/velero:v1.14.1" already present on machine
40m Normal Created pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182-wn2kv Created container velero-repo-maintenance-container
40m Normal Started pod/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182-wn2kv Started container velero-repo-maintenance-container
40m Normal SuccessfulCreate job/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182 Created pod: nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182-wn2kv
40m Normal Completed job/nginx-dev-default-kopia-9mdpf-maintain-job-1725455391182 Job completed

4. Velero log

bundle-2024-09-04-15-52-57.tar.gz

@Lyndon-Li
Copy link
Contributor

@vincmarz
By restorePod, I am not meaning the pod to be restored, but the intermediate pod created in Velero namespace during the data mover restore. I suspect that pod is not in running state until timeout. So please describe that pod.

@vincmarz
Copy link
Author

vincmarz commented Sep 4, 2024

Hi! This is what happened.
When I launch the restore, a new intermediate pod is created in openshift-adp namespace:

POD

oc -n openshift-adp get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-dev-2024-09-02-datamove-restore-rf4zg 0/1 ContainerCreating 0 33s worker-04.ocp4.policlinico.org

PVC

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
persistentvolumeclaim/nginx-dev-2024-09-02-datamove-restore-g6cmn Bound pvc-055de3fa-b711-4b0d-aff7-e93b9b4c80aa 1Gi RWO px-csi-db 43s Filesystem

But the pod still remain in ContainerCreating state because the chosen node is not appropriate for restore because is it a storageless node (e.g in my scenario, only infra node are with storage).

So you can close or merge this issue because I see there is another issue about node selector useful during restore with data moving #8186.

Thanks for your support.

Best regards!

@Lyndon-Li
Copy link
Contributor

@vincmarz
Could you describe more of your cluster architecture? How are the infra node and storageless node organized, what are their difference in terms of usage? And what is the volume mode (i.e., WaitForFirstConsumer, Immediate) for your PVC/PV?

I think this add a new use case to #8186, we want to have more inputs to prioritize our tasks.

@vincmarz
Copy link
Author

vincmarz commented Sep 5, 2024

Hi @Lyndon-Li this is my cluster:

$ oc get nodes
NAME STATUS ROLES AGE VERSION
infra-01.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
infra-02.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
infra-03.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
master-01.ocp4.policlinico.org Ready control-plane,master 68d v1.28.7+6e2789b
master-02.ocp4.policlinico.org Ready control-plane,master 68d v1.28.7+6e2789b
master-03.ocp4.policlinico.org Ready control-plane,master 68d v1.28.7+6e2789b
worker-01.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
worker-02.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
worker-03.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
worker-04.ocp4.policlinico.org Ready worker 36d v1.28.7+6e2789b

We have 3 node for Portworx Cluster:
infra-01.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
infra-02.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
infra-03.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b

And 4 storageless nodes:
worker-01.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
worker-02.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
worker-03.ocp4.policlinico.org Ready worker 68d v1.28.7+6e2789b
worker-04.ocp4.policlinico.org Ready worker 36d v1.28.7+6e2789b

For these node we use a server NFS for any storage needs.

For Portworx nodes, we use the following CSI storage class:

$ oc get sc px-csi-db -o yaml
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
params/aggregation_level: Specifies the number of replication sets the volume
can be aggregated from
params/block_size: Block size
params/docs: https://docs.portworx.com/scheduler/kubernetes/dynamic-provisioning.html
params/fs: 'Filesystem to be laid out: none|xfs|ext4'
params/io_profile: 'IO Profile can be used to override the I/O algorithm Portworx
uses for the volumes: db|sequential|random|cms'
params/journal: Flag to indicate if you want to use journal device for the volume's
metadata. This will use the journal device that you used when installing Portworx.
It is recommended to use a journal device to absorb PX metadata writes
params/priority_io: 'IO Priority: low|medium|high'
params/repl: 'Replication factor for the volume: 1|2|3'
params/secure: 'Flag to create an encrypted volume: true|false'
params/shared: 'Flag to create a globally shared namespace volume which can be
used by multiple pods: true|false'
params/sticky: Flag to create sticky volumes that cannot be deleted until the
flag is disabled
storageclass.kubernetes.io/is-default-class: "true"
creationTimestamp: "2024-07-11T10:25:41Z"
name: px-csi-db
resourceVersion: "29055571"
uid: 2977239a-793d-4765-b3c9-abe305f500a3
parameters:
io_profile: db_remote
repl: "3"
provisioner: pxd.portworx.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

Copy link

github-actions bot commented Nov 9, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@github-actions github-actions bot added the staled label Nov 9, 2024
Copy link

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants