Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
Signed-off-by: Tiger Kaovilai <[email protected]>
  • Loading branch information
kaovilai committed Oct 31, 2022
1 parent 8b805d5 commit 5aa815a
Showing 1 changed file with 103 additions and 65 deletions.
168 changes: 103 additions & 65 deletions design/worker-pods.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# Worker Pods for Backup, Restore
# Worker Jobs for Backup, Restore

This document proposes a new approach to executing backups and restores where each operation is run in its own "worker pod" rather than in the main Velero server pod. This approach has significant benefits for concurrency, scalability, and observability.
This document proposes a new approach to executing backups and restores where each operation is run in its own "worker job" rather than in the main Velero server pod. This approach has significant benefits for concurrency, scalability, and observability.

## Goals

- Enable multiple backups/restores to be run concurrently by running each operation in its own worker pod.
- Enable multiple backups/restores to be run concurrently by running each operation in its own worker job.
- Improve Velero's scalability by distributing work across multiple pods.
- Allow logs for in-progress backups/restores to be streamed by the user.

## Non Goals

- Adding concurrency *within* a single backup or restore (including restic).
- Adding concurrency *within* a single backup or restore.
- Creating CronJobs for scheduled backups.

## Background
Expand All @@ -21,11 +21,13 @@ Because each Velero controller is configured to run with a single worker, only o

## High-Level Design

Velero controllers will no longer directly execute backup/restore logic themselves (note: the rest of this document will refer only to backups, for brevity, but it applies equally to restores). Instead, when the controller is informed of a new backup custom resource, it will immediately create a new worker pod which is responsible for end-to-end processing of the backup, including validating the spec, scraping the Kubernetes API server for resources, triggering persistent volume snapshots, writing data to object storage, and updating the custom resource's status as appropriate.
Velero controllers will no longer directly execute backup/restore logic themselves (note: the rest of this document will refer only to backups, for brevity, but it applies equally to restores). Instead, when the controller is informed of a new backup custom resource, it will immediately create a new worker job which is responsible for end-to-end processing of the backup, including validating the spec, scraping the Kubernetes API server for resources, triggering persistent volume snapshots, writing data to object storage, and updating the custom resource's status as appropriate.

A worker pod will be given a deterministic name based on the name of the backup it's executing. This will prevent Velero from inadvertently creating multiple worker pods for the same backup, since any subsequent attempts to create a pod with the specified name will fail.
A worker job will be given a deterministic name based on the name of the backup it's executing. This will prevent Velero from inadvertently creating multiple worker jobs for the same backup, since any subsequent attempts to create a pod with the specified name will fail. Additionally, velero can check Backup Storage Location if the backup exists in storage already, if so, worker job would creation would not be attempted in the first place.

This design trivially enables running multiple backups concurrently, as each one runs in its own isolated pod and the Velero server's backup controller does not need to wait for the backup to complete before spawning a worker pod for the next one. Additionally, Velero becomes much more scalable, as the resource requirements for the Velero server itself are largely unaffacted by the number of backups. Instead, Velero's scalability becomes limited only by the total amount of resources available in the cluster to process worker pods.
[Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job/) will create a worker pod and retries on any failures until it runs to completion. Using Jobs means we don't have to re-implement rescheduling failed worker pods and we can leave that to Kubernetes as long as we fail the worker pod accurately.

This design trivially enables running multiple backups concurrently, as each one runs in its own isolated pod and the Velero server's backup controller does not need to wait for the backup to complete before spawning a worker job for the next one. Additionally, Velero becomes much more scalable, as the resource requirements for the Velero server itself are largely unaffacted by the number of backups. Instead, Velero's scalability becomes limited only by the total amount of resources available in the cluster to process worker jobs.

## Detailed Design

Expand All @@ -38,11 +40,11 @@ A new hidden command will be added to the velero binary, `velero backup run BACK
```bash
--client-burst
--client-qps
--default-backup-storage-location
--default-backup-ttl
--default-volume-snapshot-locations
--backup-storage-location
--backup-ttl
--volume-snapshot-locations
--log-level
--restic-timeout
--unified-repo-timeout
```

`velero restore run` will accept the following flags:
Expand All @@ -51,22 +53,54 @@ A new hidden command will be added to the velero binary, `velero backup run BACK
--client-burst
--client-qps
--log-level
--restic-timeout
--unified-repo-timeout
--restore-resource-priorities
--terminating-resource-timeout
```

### Controller changes

Most of the logic currently in the backup controller will be moved into the `velero backup run` command. The controller will be updated so that when it's informed of a new backup custom resource, it attempts to create a new worker pod for the backup. If the create is successful or a worker pod already exists with the specified name, no further action is needed and the backup will not be re-added to the controller's work queue.
Most of the logic currently in the backup controller will be moved into the `velero backup run` command. The controller will be updated so that when it's informed of a new backup custom resource, it attempts to create a new worker job for the backup. If the create is successful or a worker job already exists with the specified name, no further action is needed and the backup will not be re-added to the controller's work queue.

Additionally, the backup controller will be given a resync function that periodically re-enqueues all backups, to ensure that no new backups are missed.

The spec for the worker pod will look like the following:
#### Diagram of the backup controller's new behavior using Job with `backoffLimit: 3`:

<!-- this diagram is viewable on https://github.com/kaovilai/velero/blob/design-concurrent-backup/design/worker-pods.md -->

```mermaid
graph LR;
subgraph Velero Managed Resources
A[Backup CR-1] & C[Backup CR-2] & M[Backup CR-3] -->|Watched by|B;
B((Velero Controller))-->|Create|D & E & N;
end
D[Worker Job-1]-->|Create|F & FF;
E[Worker Job-2]-->|Create|G;
N[Worker Job-3]-->|Create|O & OO & OOO;
F((Worker Pod#1))-->|PodStatus|J;
FF((Worker Pod#2))-->|PodStatus|L;
G((Worker Pod#1))-->|PodStatus|K;
O((Worker Pod#1))-->|PodStatus|Q;
OO((Worker Pod#2))-->|PodStatus|R;
OOO((Worker Pod#3))-->|PodStatus|S;
J[Failed];
L[Succeeded]-->|Update job status|D;
K[Succeeded]-->|Update job status|E;
Q[Failed];
R[Failed];
S[Failed]-->|Update job status|N;
N-->|Failed|B;
E-->|Succeeded|B;
D-->|Succeeded|B;
```

The spec for the worker job will look like the following:

```yaml
apiVersion: v1
kind: Pod
apiVersion: batch/v1
kind: Job
metadata:
labels:
component: velero
Expand All @@ -80,57 +114,61 @@ metadata:
name: <BACKUP_NAME>
uid: <BACKUP_UID>
spec:
containers:
- args:
- <BACKUP_NAME>
- --client-burst=<VAL>
- --client-qps=<VAL>
- --default-backup-storage-location=<VAL>
- --default-backup-ttl=<VAL>
- --default-volume-snapshot-locations=<VAL>
- --log-level=<VAL>
- --restic-timeout=<VAL>
command:
- /velero
- backup
- run
env:
- name: VELERO_SCRATCH_DIR
value: /scratch
- name: AWS_SHARED_CREDENTIALS_FILE
value: /credentials/cloud
image: <VELERO_SERVER_IMAGE>
imagePullPolicy: IfNotPresent
name: velero
volumeMounts:
- mountPath: /plugins
name: plugins
- mountPath: /scratch
name: scratch
- mountPath: /credentials
name: cloud-credentials
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: velero-token-plwsw
readOnly: true
serviceAccount: velero
volumes:
- emptyDir: {}
name: plugins
- emptyDir: {}
name: scratch
- name: cloud-credentials
secret:
defaultMode: 420
secretName: cloud-credentials
- name: velero-token-plwsw
secret:
defaultMode: 420
secretName: velero-token-plwsw
ttlSecondsAfterFinished: 100 #cleanup on k8s 1.23+ removing job and dependent pods once completed
backoffLimit: 3
template:
spec:
containers:
- args:
- <BACKUP_NAME>
- --client-burst=<VAL>
- --client-qps=<VAL>
- --backup-storage-location=<VAL>
- --backup-ttl=<VAL>
- --volume-snapshot-locations=<VAL>
- --log-level=<VAL>
- --unified-repo-timeout=<VAL>
command:
- /velero
- backup
- run
env:
- name: VELERO_SCRATCH_DIR
value: /scratch
- name: AWS_SHARED_CREDENTIALS_FILE
value: /credentials/cloud
image: <VELERO_SERVER_IMAGE>
imagePullPolicy: IfNotPresent
name: velero
volumeMounts:
- mountPath: /plugins
name: plugins
- mountPath: /scratch
name: scratch
- mountPath: /credentials
name: cloud-credentials
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: velero-token-plwsw
readOnly: true
serviceAccount: velero
volumes:
- emptyDir: {}
name: plugins
- emptyDir: {}
name: scratch
- name: cloud-credentials
secret:
defaultMode: 420
secretName: cloud-credentials
- name: velero-token-plwsw
secret:
defaultMode: 420
secretName: velero-token-plwsw
```
### Updates to `velero backup logs`

The `velero backup logs` command will be modified so that if a backup is in progress, the logs are gotten from the worker pod's stdout (essentially proxying to `kubectl -n velero logs POD_NAME`). A `-f/--follow` flag will be added to allow streaming of logs for an in-progress backup.
The `velero backup logs` command will be modified so that if a backup is in progress, the logs are gotten from the worker job's current worker pod stdout (essentially proxying to `kubectl -n velero logs POD_NAME`). A `-f/--follow` flag will be added to allow streaming of logs for an in-progress backup.

Once a backup is complete, the logs will continue to be uploaded to object storage, and `velero backup logs` will still fetch them from there.

Expand All @@ -142,9 +180,9 @@ The Velero server keeps a record of current in-progress backups and disallows th

- Given that multiple backups/restores could be running concurrently, we need to consider possible areas of contention/conflict between jobs, including (but not limited to):
- exec hooks (i.e. don't want to run `fsfreeze` twice on the same pod)
- restic backups and restores on the same volume
- unified-repo backups and restores on the same volume
- It probably makes sense to use Kubernetes Jobs to control worker pods, rather than directly creating "bare pods". The Job will ensure that a worker pod successfully runs to completion.
- Currently, restic repository lock management is handled by an in-process lock manager in the Velero server. In order for backups/restores to safely run concurrently, the design for restic lock management needs to change. There is an [open issue](https://github.com/heptio/velero/issues/1540) for this which is currently not prioritized.
- Currently, unified-repo repository lock management is handled by an in-process lock manager in the Velero server. In order for backups/restores to safely run concurrently, the design for unified-repo lock management needs to change. There is an [open issue](https://github.com/heptio/velero/issues/1540) for this which is currently not prioritized.
- There are several prometheus metrics that are emitted as part of the backup process. Since backups will no longer be running in the Velero server, we need to find a way to expose those values. One option is to store any value that would feed a metric as a field on the backup's `status`, and to have the Velero server scrape values from there for completed backups. Another option is to use the [Prometheus push gateway](https://prometheus.io/docs/practices/pushing/).
- Over time, many completed worker pods will exist in the `velero` namespace. We need to consider whether this poses any issue and whether we should garbage-collect them more aggressively.

Expand Down

0 comments on commit 5aa815a

Please sign in to comment.