Design to enable concurrency of operations in velero pod for backup and restore #5510

kaovilai · 2022-10-28T06:28:47Z

⚠️ this PR is being updated to remove references of worker pods.

The intent is to move towards ability to run concurrently backup/restore operations inside a single velero pod.

Details will follow.

original PR description below pending removal:

Thank you for contributing to Velero!

Please add a summary of your change

This PR extends original PR #1653
with a major difference being the use of Kubernetes Jobs instead of Pods directly for worker pod lifecycle management.

graph LR;

subgraph Velero Managed Resources
    A[Backup CR-1] & C[Backup CR-2] & M[Backup CR-3] -->|Watched by|B;
    B((Velero Controller))-->|Create|D & E & N;
    end
    D[Worker Job-1]-->|Create|F & FF;
    E[Worker Job-2]-->|Create|G;
    N[Worker Job-3]-->|Create|O & OO & OOO;
    F((Worker Pod#1))-->|PodStatus|J;
    FF((Worker Pod#2))-->|PodStatus|L;
    G((Worker Pod#1))-->|PodStatus|K;
    O((Worker Pod#1))-->|PodStatus|Q;
    OO((Worker Pod#2))-->|PodStatus|R;
    OOO((Worker Pod#3))-->|PodStatus|S;
    J[Failed];
    L[Succeeded]-->|Update job status|D;
    K[Succeeded]-->|Update job status|E;
    Q[Failed];
    R[Failed];
    S[Failed]-->|Update job status|N;
    N-->|Failed|B;
    E-->|Succeeded|B;
    D-->|Succeeded|B;

Does your change fix a particular issue?

Fixes #(issue)
#2601

Please indicate you've done the following:

[] Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
Updated the corresponding documentation in site/content/docs/main.

/kind changelog-not-required

kaovilai · 2022-10-31T15:22:17Z

Received comments to avoid using jobs due to unpredictability of "non-parallel jobs"

from k8s

Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice.

Reverting jobs back to pods. The initial motivation for using jobs is to make it easier to restart (with jobs, automatically) failed backup CRs in-place. This could be a separate enhancement (will file an issue if there isn't one).

jiangfoxi · 2023-07-14T08:19:30Z

hello，is there any progress on this concurrency topic?
velero is good for small data，but not capable of handling massive amounts of data such as above 1T for real production environment

shawn-hurley · 2023-07-21T16:14:15Z

Unless there is some way to share a cache of resources listed, I would worry about this DDOSing the API server.

We could make sure that only X numbers get kicked off, but sharing a cache would generally make this process more performant and would allow the process to look and feel like most other controllers IMO.

kaovilai · 2023-08-09T20:42:36Z

⚠️ this PR is being updated to remove references of worker pods.

The intent is to move towards ability to run concurrently backup/restore operations inside a single velero pod.

Details will follow.

sseago · 2023-08-10T15:36:14Z

Closing this in favor of an approach that does not create new pods.

jiangfoxi · 2023-09-08T08:01:04Z

so, what is the progress for backup/restore concurrency? I really love velero, but when there is a large amount of data such as 100T in our project, there is no concurrent backup and the speed is very very very slow，that drives me crazy!!!!

jiangfoxi · 2023-09-08T08:01:45Z

When will this feature be available?

worker pods from https://github.com/vmware-tanzu/velero/pull/1653/files

8b805d5

github-actions bot added the Area/Design Design Documents label Oct 28, 2022

kaovilai changed the title ~~Enable concurrency of operations using worker pod for Backup and Restore~~ Design to enable concurrency of operations using worker pod for backup and restore Oct 28, 2022

kaovilai force-pushed the design-concurrent-backup branch 7 times, most recently from 2e51083 to 5aa815a Compare October 31, 2022 07:23

add details on k8s jobs

2aeac0c

kaovilai force-pushed the design-concurrent-backup branch from 5aa815a to 2aeac0c Compare November 17, 2022 11:00

benedikt-bartscher mentioned this pull request Apr 12, 2023

[RFE] support running restic backups concurrently #1531

Closed

kaovilai changed the title ~~Design to enable concurrency of operations using worker pod for backup and restore~~ Design to enable concurrency of operations in velero pod for backup and restore Aug 9, 2023

sseago closed this Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design to enable concurrency of operations in velero pod for backup and restore #5510

Design to enable concurrency of operations in velero pod for backup and restore #5510

kaovilai commented Oct 28, 2022 •

edited

Loading

kaovilai commented Oct 31, 2022

jiangfoxi commented Jul 14, 2023

shawn-hurley commented Jul 21, 2023

kaovilai commented Aug 9, 2023

sseago commented Aug 10, 2023

jiangfoxi commented Sep 8, 2023 •

edited

Loading

jiangfoxi commented Sep 8, 2023

Design to enable concurrency of operations in velero pod for backup and restore #5510

Design to enable concurrency of operations in velero pod for backup and restore #5510

Conversation

kaovilai commented Oct 28, 2022 • edited Loading

original PR description below pending removal:

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

kaovilai commented Oct 31, 2022

jiangfoxi commented Jul 14, 2023

shawn-hurley commented Jul 21, 2023

kaovilai commented Aug 9, 2023

sseago commented Aug 10, 2023

jiangfoxi commented Sep 8, 2023 • edited Loading

jiangfoxi commented Sep 8, 2023

kaovilai commented Oct 28, 2022 •

edited

Loading

jiangfoxi commented Sep 8, 2023 •

edited

Loading