Support detect k8s resource dependency during backup #7199

blackpiglet · 2023-12-11T10:02:41Z

Describe the problem/challenge you have

It's better to know the backed-up k8s resource's dependency.
If the Velero server knows it, it can detect invalid backups before running the backup process.
This feature can help to resolve the scenario described in PR #7045.

This feature also has benefits for

Running multiple backups parallelly (check whether the backups have resource overlapping)
Supporting advanced resource restore sequences.
Backup/Restore pause and resume.

Describe the solution you'd like

The Velero server can use a DAG(Directed Acyclic Graphs) as the data structure to store the backup resources.

The DAG's content should be:

The DAG could have a root node, which is the backup itself.
The children of the DAG's root node should be the resources not relying on other k8s resources, e.g. CRD, namespaces, StorageClasses, and VolumeSnapshotClasses.
The node of the DAG could have multiple parents and multiple children.

Say this string represents a DAG, The resource backup sequence should ordered from left to right.

e > f, g > h;

The DAG should be generated by existing rules:

The Velero server's high-priority and low-priority resource settings.
Owner Reference rule.
The potential user-provided rules(may need a new CRD here).

During generating the rules, if the later rules violate the existing DAG resource hierarchy, fail the backup, and warn the user the rule is invalid.

When taking the backup, it should start from the root node, and go through the root node's children. After that, traverse the children's children. If backup gets a resource, but the resource's parents are not all backed up yet, the Velero server should put it on hold, and go on, then the Velero server should retry with the on-hold resources before traversing the next layer of resources.

Anything else you would like to add:

Environment:

Velero version (use velero version):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "The project would be better with this feature added"
👎 for "This feature will not enhance the project in a meaningful way"

The text was updated successfully, but these errors were encountered:

blackpiglet · 2023-12-14T03:08:03Z

I put some consideration to the BIA scenario here.
The BIA is different because the Velero collects all cared resources at the start of the backup. Still, the BIAs are executed during the backup process, and return more additional resources also included in the backup.

That means the Velero server cannot determine the whole scope of the backup, until the backup finishes. That makes supporting parallel backups not possible because the Velero server cannot detect the backups' overlap and potential conflicts.

There is also some discussion about making the BIA add a new method to return the additional resources during the backup resource collecting stage. I think it cannot resolve the issue.

First, many additional resources the BIAs care about are created during the BIA running, so it's not possible to know the additional resources before that, and I think it's safe for the parallel backup scenario because they will not cause any resource overlap.
Second, the Velero server should do nothing other than archive the additional resources' YAML into the metadata file. Even the additional resources already existed in the metadata before the BIAs returned it, which should not do any harm to the other parallel backups.

I think the real problem BIA caused is that the Velero server cannot know what the BIAs do. If the BIA freezes the filesystem of a pod that is not included in the backup, although IMO it shouldn't happen, it will impact parallel filesystem backups.

Unfortunately, as an external binary, it's not possible to regulate the plugins' behavior.
IMO, we can only give a guideline of how the plugins should work to make the parallel backups work.

reasonerjt · 2024-02-06T07:30:15Z

I think how to define "dependency" is a topic may cause a lot of debating, and is very complicated considering the customer resource.
As for the data structure to track the dependency, there's a design that has been merged:
https://github.com/vmware-tanzu/velero/blob/main/design/graph-manifest.md
We may consider use this data structure to solve specific problems, instead of trying to introduce a generic approach to handle all resources.

Marking this as "ice-box" as we may need more concrete use cases and handle them separately.

github-actions · 2024-04-08T01:47:32Z