Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip PVB creation when PVC excluded #7045

Conversation

sbahar619
Copy link
Contributor

@sbahar619 sbahar619 commented Nov 1, 2023

Please add a summary of your change

FS backup create PodVolumeBackup when the backup excluded PVC, so I added logic to skip PVC volume type when PVC is not included in the backup resources to be backed up.

Does your change fix a particular issue?

Fixes #(issue)

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@sbahar619 sbahar619 force-pushed the add-logic-to-skip-pvb-creation-when-pvc-excluded branch from afd8a84 to 2cb4334 Compare November 1, 2023 13:37
pkg/backup/item_backupper.go Outdated Show resolved Hide resolved
@sseago
Copy link
Collaborator

sseago commented Nov 1, 2023

@sbahar619 Also, see the CI/linter check failures -- looks like whitespace issues to fix via "make update", etc. And you need a changelog entry: https://velero.io/docs/main/code-standards/#adding-a-changelog

@sbahar619 sbahar619 force-pushed the add-logic-to-skip-pvb-creation-when-pvc-excluded branch from 2cb4334 to 2483648 Compare November 2, 2023 08:13
@sbahar619
Copy link
Contributor Author

@sbahar619 Also, see the CI/linter check failures -- looks like whitespace issues to fix via "make update", etc. And you need a changelog entry: https://velero.io/docs/main/code-standards/#adding-a-changelog

Thanks, I added changelogs file, I will fix the lint issue.

@sbahar619 sbahar619 changed the title Skip pvb creation when pvc excluded Skip PVB creation when PVC excluded Nov 2, 2023
@sbahar619 sbahar619 force-pushed the add-logic-to-skip-pvb-creation-when-pvc-excluded branch from 2483648 to e6b3438 Compare November 2, 2023 15:12
@sseago
Copy link
Collaborator

sseago commented Nov 2, 2023

@sbahar619 I found a problem with a couple of the unit tests -- they weren't including PVCs in the backup but were expecting PVBs, so that's no longer going to work with this change. I submitted a PR against your PR branch -- if you merge that, it will be added to this PR: sbahar619#1

@sbahar619
Copy link
Contributor Author

@sbahar619 I found a problem with a couple of the unit tests -- they weren't including PVCs in the backup but were expecting PVBs, so that's no longer going to work with this change. I submitted a PR against your PR branch -- if you merge that, it will be added to this PR: sbahar619#1

Thanks!
I merged your PR🙏🏼😀

@shubham-pampattiwar
Copy link
Collaborator

@sbahar619 please sign your commit and squash them.

@sbahar619 sbahar619 force-pushed the add-logic-to-skip-pvb-creation-when-pvc-excluded branch from e5f6693 to 23fbc19 Compare November 6, 2023 13:57
@sbahar619
Copy link
Contributor Author

@sbahar619 please sign your commit and squash them.

Done @shubham-pampattiwar , thanks.

Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (03e582c) 61.03% compared to head (23fbc19) 61.00%.
Report is 257 commits behind head on main.

Files Patch % Lines
pkg/backup/item_backupper.go 36.36% 6 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7045      +/-   ##
==========================================
- Coverage   61.03%   61.00%   -0.04%     
==========================================
  Files         252      255       +3     
  Lines       26900    27048     +148     
==========================================
+ Hits        16419    16501      +82     
- Misses       9319     9367      +48     
- Partials     1162     1180      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sbahar619
Copy link
Contributor Author

@qiuming-best
@blackpiglet
Can you please review?🙂

@blackpiglet
Copy link
Contributor

blackpiglet commented Nov 11, 2023

There is still some discussion about whether this logic should be added.

The concern is when the pod's volumes are specified to back up by file-system upload, but the PVC is excluded, which rule should have a higher priority.

The current way is not to miss the PVC volume backup.
We may need to discuss which way is preferred.

@sbahar619
Copy link
Contributor Author

@blackpiglet
Thanks for letting me know🙏🏼
How can I track this discussion and get updated about it?

@sseago
Copy link
Collaborator

sseago commented Nov 13, 2023

"The concern is when the pod's volumes are specified to back up by file-system upload, but the PVC is excluded, which rule should have a higher priority."

Hmm, I don't think that's really an issue. The "fs vs. snapshot" decision determines how a volume is backed up. In the case handled by this PR, all PVCs are excluded (i.e. they're excluded resources), which means a PVB for this PVC can never be restored because the volume won't be found when restoring the backup.

This results in a problematic user workflow:

  1. create backup explicitly excluding PVCs
  2. restore backup, fails with PVB/PVR errors
  3. User is confused "I excluded PVCs from my backup, why am I getting PVC-related not found error on restore?"

@sbahar619 What was the error you got in this case without the PR fix?

@sbahar619
Copy link
Contributor Author

@sseago
I agree with you.
Here is the error logs of a restore on successful FS backup which excluded PVC:
time="2023-11-14T08:35:45Z" level=warning msg="unable to restore additional item" additionalResource=persistentvolumeclaims additionalResourceName=mysql additionalResourceNamespace=a1 error="stat /tmp/2458660822 /resources/persistentvolumeclaims/namespaces/a1/mysql.json: no such file or directory" logSource="/remote-source/velero/app/pkg/restore/restore.go:1396" restore=openshift-adp/r2 time="2023-11-14T08:35:45Z" level=warning msg="unable to restore additional item" additionalResource=persistentvolumeclaims additionalResourceName=mysql-1 additionalResourceNamespace=a1 error="stat /tmp/24586608 22/resources/persistentvolumeclaims/namespaces/a1/mysql-1.json: no such file or directory" logSource="/remote-source/velero/app/pkg/restore/restore.go:1396" restore=openshift-adp/r2
The restore PartiallyFailed.

@sseago
Copy link
Collaborator

sseago commented Nov 14, 2023

@sbahar619 @blackpiglet hmm. Another possible resolution would be to only warn instead of fail if an additional item is not found in the backup at restore time. Unclear whether that's any better, as there may be cases where we want missing additional items to error out.

@blackpiglet
Copy link
Contributor

blackpiglet commented Dec 7, 2023

@sbahar619 @sseago
I'm sorry for not getting back to you sooner.
The scenario is still not clear to me.
The pod is backed up, but its mounted PVC is not. Even if we skip creating the PVB, will the pod run after restoration? Or did you make more modifications to the pod during the restoration?

@sbahar619
Copy link
Contributor Author

@sbahar619 @sseago I'm sorry for not getting back to you sooner. The scenario is still not clear to me. The pod is backed up, but its mounted PVC is not. Even if we skip creating the PVB, will the pod run after restoration? Or did you make more modifications to the pod during the restoration?

The issue is that currently, if you try to run a FS backup with excluded PVC, the PVB will still be created (which I disabled in this PR because it is not expected) for all the PVCs, but in the end, the PVC resource itself is really not backed up because of the excluded PVC which is expected, this backup is completed successfully, so what happens is that when you run a restore using this same backup, it will fail complaining it's not finding any PVCs, because the restore here is triggering PVR, which is again not expected.
So the user will run a successful FS backup, with excluded PVC, which finish successfully, then creating restore using the same backup, and get PVC not found error, and the user will be confused.

@blackpiglet
Copy link
Contributor

@shobha2626
To clarify the situation, is it attempting to back up a pod that mounts PVC without including the PVC that is mounted?
I don't think this should happen. Velero can't figure out the k8s resources dependency yet. Assuming it does, IMO, it should fail the backup, and warn the user that the needed resource is not included.

Out of curiosity, will the restore be completed successfully with this PR?

@sbahar619
Copy link
Contributor Author

@shobha2626

To clarify the situation, is it attempting to back up a pod that mounts PVC without including the PVC that is mounted?

I don't think this should happen. Velero can't figure out the k8s resources dependency yet. Assuming it does, IMO, it should fail the backup, and warn the user that the needed resource is not included.

Out of curiosity, will the restore be completed successfully with this PR?

Currently, if you run a FS backup with PVC excluded, the PVB will still be created, and backup the data, which is not expected.
The PVC resource yaml itself is skipped.
So trying to restore this backup will again trigger PVR but restore will fail complaining PVC not found.

And yes, I verified this PR, if you run backup using this PR fix, backup and restore will completed successfully and fully skip the PVC.

@blackpiglet
Copy link
Contributor

blackpiglet commented Dec 8, 2023

Could you share your backed-up workload's YAML?
Does the PVC still exist in the cluster, before applying the restore?

@sbahar619
Copy link
Contributor Author

Could you share your backed-up workload's YAML? Does the PVC still exist in the cluster, before applying the restore?

Sure, here is a backup yaml:

kind: Backup
metadata:
  annotations:
    helm.sh/resource-policy: keep
    meta.helm.sh/release-name: aws
    meta.helm.sh/release-namespace: openshift-adp
    velero.io/resource-timeout: 10m0s
    velero.io/source-cluster-k8s-gitversion: v1.27.6+f67aeb3
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: "27"
  creationTimestamp: "2023-10-31T08:49:13Z"
  generation: 6
  labels:
    app.kubernetes.io/managed-by: Helm
    velero.io/storage-location: default
  name: b1
  namespace: openshift-adp
  resourceVersion: "62058"
  uid: 04df0caa-fe3e-4742-bff7-19e11fa10fc6
spec:
  csiSnapshotTimeout: 10m0s
  defaultVolumesToFsBackup: true
  includedNamespaces:
  - a1
  includedResources:
  - pod
  - deployment
  itemOperationTimeout: 4h0m0s
  snapshotMoveData: false
  storageLocation: default
  ttl: 720h0m0s
status:
  completionTimestamp: "2023-10-31T08:50:21Z"
  expiration: "2023-11-30T08:49:13Z"
  formatVersion: 1.1.0
  phase: Completed
  progress:
    itemsBackedUp: 2
    totalItems: 2
  startTimestamp: "2023-10-31T08:49:13Z"
  version: 1 

After the backup, I deleted the NS application and the PVCs, then tried restore.

@reasonerjt
Copy link
Contributor

As we have reached the FC v1.13 and this makes break change to current behavior, I'm blocking this PR from merging to main, until we cut the release branch for v1.13

Copy link
Contributor

@reasonerjt reasonerjt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revisit this PR after the release branch is cut

@blackpiglet
Copy link
Contributor

@sbahar619
I did some tests with this PR. The restore can be completed, but the pod cannot start.

Events:
  Type     Reason             Age   From                Message
  ----     ------             ----  ----                -------
  Warning  FailedScheduling   118s  default-scheduler   0/1 nodes are available: persistentvolumeclaim "my-pvc" not found. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Warning  FailedScheduling   117s  default-scheduler   0/1 nodes are available: persistentvolumeclaim "my-pvc" not found. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Warning  FailedScheduling   115s  default-scheduler   0/1 nodes are available: persistentvolumeclaim "my-pvc" not found. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Normal   NotTriggerScaleUp  116s  cluster-autoscaler  pod didn't trigger scale-up: 1 persistentvolumeclaim "my-pvc" not found

@sbahar619
Copy link
Contributor Author

@sbahar619

I did some tests with this PR. The restore can be completed, but the pod cannot start.


Events:

  Type     Reason             Age   From                Message

  ----     ------             ----  ----                -------

  Warning  FailedScheduling   118s  default-scheduler   0/1 nodes are available: persistentvolumeclaim "my-pvc" not found. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

  Warning  FailedScheduling   117s  default-scheduler   0/1 nodes are available: persistentvolumeclaim "my-pvc" not found. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

  Warning  FailedScheduling   115s  default-scheduler   0/1 nodes are available: persistentvolumeclaim "my-pvc" not found. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

  Normal   NotTriggerScaleUp  116s  cluster-autoscaler  pod didn't trigger scale-up: 1 persistentvolumeclaim "my-pvc" not found

This is expected, because we asked to exclude PVC in backup so restore have no PVC, pod manifest still request PVC volume so pod will be pending and not starting.

@blackpiglet
Copy link
Contributor

@sbahar619
OK, but what scenario needs this restoration result?
IMO, most users expect the restored workload to end with the Running state.

@sbahar619
Copy link
Contributor Author

@sbahar619

OK, but what scenario needs this restoration result?

IMO, most users expect the restored workload to end with the Running state.

I don't think it's a common scenario, but the behavior of Velero is more accurate with this PR.
Without the PR you will end with fail restore, and error message that not makes sense with the backup specification.

@blackpiglet
Copy link
Contributor

blackpiglet commented Dec 11, 2023

I see.
If this PR is trying to have better information for the user, we could resolve it in other ways.

If we can have a mechanism to resolve the backed-up resource dependency, that would be an ideal way, but there is probably no available solution in the short term.
Because the backup is already in progress when the Velero server can detect the error, failing the backup may not be a good idea. At least, we should print some warning information for this case.
And we should move this check into a more proper function, e.g. GetVolumesByPod.

@weshayutin
Copy link
Contributor

@shubham-pampattiwar please review as you have time.

@sseago
Copy link
Collaborator

sseago commented Dec 13, 2023

@blackpiglet Note that the "pod is pending" is probably what you'll see without this PR if you're not using fs-backup, since that's a function of the user choosing to exclude PVC while including pods. I think the restore error obscures what's going on a bit. I suspect that this PR brings fs backup use case inline with what we already see for datamover or snapshot backups.

@blackpiglet
Copy link
Contributor

blackpiglet commented Dec 14, 2023

@sseago
Thanks for the clarification. I think, after the v1.13 branch cut, we can merge this PR after the review completed.

@reasonerjt
Copy link
Contributor

reasonerjt commented Jan 11, 2024

There are two issues involved:

  1. When the PVC is excluded, the PVB should not be created. This is not consistent with how we handle CSI snapshots.
  2. When a backup excludes all PVC, any pods that have PVC mounted will not be restored successfully.

I think the opinion on number 2 will impact how we handle number 1.

I'm more inclined to let velero do what the user tells it to, i.e. it should exclude the PVC even if it will certainly fail the pod after restore.
@sseago
WDYT?

@sseago
Copy link
Collaborator

sseago commented Jan 11, 2024

Regarding 1) "When the PVC is excluded, the PVB should not be created. This is not consistent with how we handle CSI snapshots." -- Actually, this is consistent with CSI. If the PVC is excluded, then we don't back it up. If we don't back it up, then the BIA for the PVC doesn't run. If the BIA doesn't run, we don't create the VolumeSnapshot.

Regarding 2) "When a backup excludes all PVC, any pods that have PVC mounted will not be restored successfully." This is also true for snapshot restores. Velero reports a successful restore -- the Pod object was successfully restored, but the pod will be pending until the PVC is created. One reason a user may want to omit PVCs from the backup is because they're backing them up via some other means -- in that case, that other means will need to restore them as well.

If I'm understanding points 1) and 2) above correctly, skipping PVB creation makes FS backup/restore consistent with CSI and Volume Snapshotter backup/restore behavior. The end result is (currently for snapshots, proposed in this PR for FS backup) is as follows:

  1. on backup, if PVC is excluded, we don't back up the contents of the PVC
  2. on restore, if PVC is excluded, we restore Pod properly without reporting errors. Pod is pending until user creates PVC outside of velero.

At the most basic level, if a user is excluding PVCs from backup, then velero shouldn't waste time copying PVC contents into the BSL, whether via datamover or via FS backup.

@blackpiglet
Copy link
Contributor

blackpiglet commented Jan 29, 2024

I see. If this PR is trying to have better information for the user, we could resolve it in other ways.

If we can have a mechanism to resolve the backed-up resource dependency, that would be an ideal way, but there is probably no available solution in the short term. Because the backup is already in progress when the Velero server can detect the error, failing the backup may not be a good idea. At least, we should print some warning information for this case. And we should move this check into a more proper function, e.g. GetVolumesByPod.

@sbahar619
After the v1.13 branch is cut, we can continue on this PR.
I agree with this PR but still have some comments to address.
Will you continue to work on this?

@blackpiglet blackpiglet self-assigned this Jan 29, 2024
@sbahar619
Copy link
Contributor Author

I see. If this PR is trying to have better information for the user, we could resolve it in other ways.
If we can have a mechanism to resolve the backed-up resource dependency, that would be an ideal way, but there is probably no available solution in the short term. Because the backup is already in progress when the Velero server can detect the error, failing the backup may not be a good idea. At least, we should print some warning information for this case. And we should move this check into a more proper function, e.g. GetVolumesByPod.

@sbahar619 After the v1.13 branch is cut, we can continue on this PR. I agree with this PR but still have some comments to address. Will you continue to work on this?

Hi @blackpiglet
Sure, can you please highlight the technical comments again?
I got lost in this thread, most of the discussion is about if this PR in general solving a problem, thanks!

for _, volume := range includedVolumes {
// Check if the PVC resource is not included in the backup
if !ib.backupRequest.ResourceIncludesExcludes.ShouldInclude(kuberesource.PersistentVolumeClaims.String()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should move this check into a more proper function, e.g. GetVolumesByPod.

@blackpiglet
Copy link
Contributor

@sbahar619
Sure. I put the suggestion as an inline comment with the code.

@blackpiglet
Copy link
Contributor

Hi @sbahar619, are you still working on this?
If you have been unavailable recently, I can make the change and let this PR merge.

@sbahar619
Copy link
Contributor Author

Hi @blackpiglet , sorry for the delay!
Yes it will be great!

@blackpiglet
Copy link
Contributor

Because #7472 addressed the need in this PR, close this PR.

@blackpiglet blackpiglet closed this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants