update

Signed-off-by: Tiger Kaovilai <[email protected]>
vmware-tanzu · Oct 31, 2022 · 5aa815a · 5aa815a
1 parent 8b805d5
commit 5aa815a
Showing 1 changed file with 103 additions and 65 deletions.
diff --git a/design/worker-pods.md b/design/worker-pods.md
@@ -1,16 +1,16 @@
-# Worker Pods for Backup, Restore
+# Worker Jobs for Backup, Restore
 
-This document proposes a new approach to executing backups and restores where each operation is run in its own "worker pod" rather than in the main Velero server pod. This approach has significant benefits for concurrency, scalability, and observability.
+This document proposes a new approach to executing backups and restores where each operation is run in its own "worker job" rather than in the main Velero server pod. This approach has significant benefits for concurrency, scalability, and observability.
 
 ## Goals
 
-- Enable multiple backups/restores to be run concurrently by running each operation in its own worker pod.
+- Enable multiple backups/restores to be run concurrently by running each operation in its own worker job.
 - Improve Velero's scalability by distributing work across multiple pods.
 - Allow logs for in-progress backups/restores to be streamed by the user.
 
 ## Non Goals
 
-- Adding concurrency *within* a single backup or restore (including restic).
+- Adding concurrency *within* a single backup or restore.
 - Creating CronJobs for scheduled backups.
 
 ## Background
@@ -21,11 +21,13 @@ Because each Velero controller is configured to run with a single worker, only o
 
 ## High-Level Design
 
-Velero controllers will no longer directly execute backup/restore logic themselves (note: the rest of this document will refer only to backups, for brevity, but it applies equally to restores). Instead, when the controller is informed of a new backup custom resource, it will immediately create a new worker pod which is responsible for end-to-end processing of the backup, including validating the spec, scraping the Kubernetes API server for resources, triggering persistent volume snapshots, writing data to object storage, and updating the custom resource's status as appropriate.
+Velero controllers will no longer directly execute backup/restore logic themselves (note: the rest of this document will refer only to backups, for brevity, but it applies equally to restores). Instead, when the controller is informed of a new backup custom resource, it will immediately create a new worker job which is responsible for end-to-end processing of the backup, including validating the spec, scraping the Kubernetes API server for resources, triggering persistent volume snapshots, writing data to object storage, and updating the custom resource's status as appropriate.
 
-A worker pod will be given a deterministic name based on the name of the backup it's executing. This will prevent Velero from inadvertently creating multiple worker pods for the same backup, since any subsequent attempts to create a pod with the specified name will fail.
+A worker job will be given a deterministic name based on the name of the backup it's executing. This will prevent Velero from inadvertently creating multiple worker jobs for the same backup, since any subsequent attempts to create a pod with the specified name will fail. Additionally, velero can check Backup Storage Location if the backup exists in storage already, if so, worker job would creation would not be attempted in the first place. 
 
-This design trivially enables running multiple backups concurrently, as each one runs in its own isolated pod and the Velero server's backup controller does not need to wait for the backup to complete before spawning a worker pod for the next one. Additionally, Velero becomes much more scalable, as the resource requirements for the Velero server itself are largely unaffacted by the number of backups. Instead, Velero's scalability becomes limited only by the total amount of resources available in the cluster to process worker pods.
+[Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job/) will create a worker pod and retries on any failures until it runs to completion. Using Jobs means we don't have to re-implement rescheduling failed worker pods and we can leave that to Kubernetes as long as we fail the worker pod accurately.
+
+This design trivially enables running multiple backups concurrently, as each one runs in its own isolated pod and the Velero server's backup controller does not need to wait for the backup to complete before spawning a worker job for the next one. Additionally, Velero becomes much more scalable, as the resource requirements for the Velero server itself are largely unaffacted by the number of backups. Instead, Velero's scalability becomes limited only by the total amount of resources available in the cluster to process worker jobs.
 
 ## Detailed Design
 
@@ -38,11 +40,11 @@ A new hidden command will be added to the velero binary, `velero backup run BACK
 ```bash
 --client-burst
 --client-qps
---default-backup-storage-location
---default-backup-ttl
---default-volume-snapshot-locations
+--backup-storage-location
+--backup-ttl
+--volume-snapshot-locations
 --log-level
---restic-timeout
+--unified-repo-timeout
 ```
 
 `velero restore run` will accept the following flags:
@@ -51,22 +53,54 @@ A new hidden command will be added to the velero binary, `velero backup run BACK
 --client-burst
 --client-qps
 --log-level
---restic-timeout
+--unified-repo-timeout
 --restore-resource-priorities
 --terminating-resource-timeout
 ```
 
 ### Controller changes
 
-Most of the logic currently in the backup controller will be moved into the `velero backup run` command. The controller will be updated so that when it's informed of a new backup custom resource, it attempts to create a new worker pod for the backup. If the create is successful or a worker pod already exists with the specified name, no further action is needed and the backup will not be re-added to the controller's work queue.
+Most of the logic currently in the backup controller will be moved into the `velero backup run` command. The controller will be updated so that when it's informed of a new backup custom resource, it attempts to create a new worker job for the backup. If the create is successful or a worker job already exists with the specified name, no further action is needed and the backup will not be re-added to the controller's work queue.
 
 Additionally, the backup controller will be given a resync function that periodically re-enqueues all backups, to ensure that no new backups are missed.
 
-The spec for the worker pod will look like the following:
+#### Diagram of the backup controller's new behavior using Job with `backoffLimit: 3`:
+
+<!-- this diagram is viewable on https://github.com/kaovilai/velero/blob/design-concurrent-backup/design/worker-pods.md -->
+
+```mermaid
+graph LR;
+
+    subgraph Velero Managed Resources
+    A[Backup CR-1] & C[Backup CR-2] & M[Backup CR-3] -->|Watched by|B;
+    B((Velero Controller))-->|Create|D & E & N;
+    end
+    D[Worker Job-1]-->|Create|F & FF;
+    E[Worker Job-2]-->|Create|G;
+    N[Worker Job-3]-->|Create|O & OO & OOO;
+    F((Worker Pod#1))-->|PodStatus|J;
+    FF((Worker Pod#2))-->|PodStatus|L;
+    G((Worker Pod#1))-->|PodStatus|K;
+    O((Worker Pod#1))-->|PodStatus|Q;
+    OO((Worker Pod#2))-->|PodStatus|R;
+    OOO((Worker Pod#3))-->|PodStatus|S;
+    J[Failed];
+    L[Succeeded]-->|Update job status|D;
+    K[Succeeded]-->|Update job status|E;
+    Q[Failed];
+    R[Failed];
+    S[Failed]-->|Update job status|N;
+    N-->|Failed|B;
+    E-->|Succeeded|B;
+    D-->|Succeeded|B;
+
+```
+
+The spec for the worker job will look like the following:
 
 ```yaml
-apiVersion: v1
-kind: Pod
+apiVersion: batch/v1
+kind: Job
 metadata:
   labels:
     component: velero
@@ -80,57 +114,61 @@ metadata:
     name: <BACKUP_NAME>
     uid: <BACKUP_UID>
 spec:
-  containers:
-  - args:
-    - <BACKUP_NAME>
-    - --client-burst=<VAL>
-    - --client-qps=<VAL>
-    - --default-backup-storage-location=<VAL>
-    - --default-backup-ttl=<VAL>
-    - --default-volume-snapshot-locations=<VAL>
-    - --log-level=<VAL>
-    - --restic-timeout=<VAL>
-    command:
-    - /velero
-    - backup
-    - run
-    env:
-    - name: VELERO_SCRATCH_DIR
-      value: /scratch
-    - name: AWS_SHARED_CREDENTIALS_FILE
-      value: /credentials/cloud
-    image: <VELERO_SERVER_IMAGE>
-    imagePullPolicy: IfNotPresent
-    name: velero
-    volumeMounts:
-    - mountPath: /plugins
-      name: plugins
-    - mountPath: /scratch
-      name: scratch
-    - mountPath: /credentials
-      name: cloud-credentials
-    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
-      name: velero-token-plwsw
-      readOnly: true
-  serviceAccount: velero
-  volumes:
-  - emptyDir: {}
-    name: plugins
-  - emptyDir: {}
-    name: scratch
-  - name: cloud-credentials
-    secret:
-      defaultMode: 420
-      secretName: cloud-credentials
-  - name: velero-token-plwsw
-    secret:
-      defaultMode: 420
-      secretName: velero-token-plwsw
+  ttlSecondsAfterFinished: 100 #cleanup on k8s 1.23+ removing job and dependent pods once completed
+  backoffLimit: 3
+  template:
+    spec:
+      containers:
+      - args:
+        - <BACKUP_NAME>
+        - --client-burst=<VAL>
+        - --client-qps=<VAL>
+        - --backup-storage-location=<VAL>
+        - --backup-ttl=<VAL>
+        - --volume-snapshot-locations=<VAL>
+        - --log-level=<VAL>
+        - --unified-repo-timeout=<VAL>
+        command:
+        - /velero
+        - backup
+        - run
+        env:
+        - name: VELERO_SCRATCH_DIR
+          value: /scratch
+        - name: AWS_SHARED_CREDENTIALS_FILE
+          value: /credentials/cloud
+        image: <VELERO_SERVER_IMAGE>
+        imagePullPolicy: IfNotPresent
+        name: velero
+        volumeMounts:
+        - mountPath: /plugins
+          name: plugins
+        - mountPath: /scratch
+          name: scratch
+        - mountPath: /credentials
+          name: cloud-credentials
+        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
+          name: velero-token-plwsw
+          readOnly: true
+      serviceAccount: velero
+      volumes:
+      - emptyDir: {}
+        name: plugins
+      - emptyDir: {}
+        name: scratch
+      - name: cloud-credentials
+        secret:
+          defaultMode: 420
+          secretName: cloud-credentials
+      - name: velero-token-plwsw
+        secret:
+          defaultMode: 420
+          secretName: velero-token-plwsw
 ```
 
 ### Updates to `velero backup logs`
 
-The `velero backup logs` command will be modified so that if a backup is in progress, the logs are gotten from the worker pod's stdout (essentially proxying to `kubectl -n velero logs POD_NAME`). A  `-f/--follow` flag will be added to allow streaming of logs for an in-progress backup.
+The `velero backup logs` command will be modified so that if a backup is in progress, the logs are gotten from the worker job's current worker pod stdout (essentially proxying to `kubectl -n velero logs POD_NAME`). A  `-f/--follow` flag will be added to allow streaming of logs for an in-progress backup.
 
 Once a backup is complete, the logs will continue to be uploaded to object storage, and `velero backup logs` will still fetch them from there.
 
@@ -142,9 +180,9 @@ The Velero server keeps a record of current in-progress backups and disallows th
 
 - Given that multiple backups/restores could be running concurrently, we need to consider possible areas of contention/conflict between jobs, including (but not limited to):
   - exec hooks (i.e. don't want to run `fsfreeze` twice on the same pod)
-  - restic backups and restores on the same volume
+  - unified-repo backups and restores on the same volume
 - It probably makes sense to use Kubernetes Jobs to control worker pods, rather than directly creating "bare pods". The Job will ensure that a worker pod successfully runs to completion.
-- Currently, restic repository lock management is handled by an in-process lock manager in the Velero server. In order for backups/restores to safely run concurrently, the design for restic lock management needs to change. There is an [open issue](https://github.com/heptio/velero/issues/1540) for this which is currently not prioritized.
+- Currently, unified-repo repository lock management is handled by an in-process lock manager in the Velero server. In order for backups/restores to safely run concurrently, the design for unified-repo lock management needs to change. There is an [open issue](https://github.com/heptio/velero/issues/1540) for this which is currently not prioritized.
 - There are several prometheus metrics that are emitted as part of the backup process. Since backups will no longer be running in the Velero server, we need to find a way to expose those values. One option is to store any value that would feed a metric as a field on the backup's `status`, and to have the Velero server scrape values from there for completed backups. Another option is to use the [Prometheus push gateway](https://prometheus.io/docs/practices/pushing/).
 - Over time, many completed worker pods will exist in the `velero` namespace. We need to consider whether this poses any issue and whether we should garbage-collect them more aggressively.