Skip to content

Commit

Permalink
add design for data mover preserve local snapshot
Browse files Browse the repository at this point in the history
Signed-off-by: Lyndon-Li <[email protected]>
  • Loading branch information
Lyndon-Li committed Oct 23, 2023
1 parent 5fe53da commit 3d72d25
Show file tree
Hide file tree
Showing 4 changed files with 54 additions and 4 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,10 @@ spec:
description: OperationTimeout specifies the time used to wait internal
operations, e.g., wait the CSI snapshot to become readyToUse.
type: string
retainSnapshot:
description: RetainSnapshot specifies whether to retain the snapshot
after backup completes.
type: boolean
snapshotType:
description: SnapshotType is the type of the snapshot to be backed
up.
Expand Down Expand Up @@ -335,6 +339,10 @@ spec:
format: int64
type: integer
type: object
retainedSnapshot:
description: RetainedSnapshot is name of the snapshot that has been
retained.
type: string
snapshotID:
description: SnapshotID is the identifier for the snapshot in the
backup repository.
Expand Down Expand Up @@ -637,9 +645,39 @@ In DUCR/DDCR’s status, we have fields like ```totalBytes``` and ```doneBytes``
- Call ```kubectl get dataupload -n velero xxx or kubectl get datadownload -n velero xxx```.
- Call ```velero backup describe –details```. This is implemented as part of BIA/RIA V2, the above values are transferred to async operation and this command retrieve them from the async operation instead of DUCR/DDCR. See [general progress monitoring design][2] for details

## Retain Native Snapshots
Users are allowed to specify a global number as the number of native snapshots to be retained for each volume, then:
- If the number is not specified, or if it is not a positive value, it means no native snapshot is to be retained
- Otherwise, the value defines the max snapshots to be retained for each volume, a.k.a, the limit. If the limit is exceeded, the oldest retained snapshot will be removed

The limit number is set as an annotation of a backup CR, the annotation is called ```backup.velero.io/data-mover-snapshot-to-retain```. In this way, CMPs are able to query the number from the backup CR.
CMP is resiposible to maintain the native snapshots. Specifically, it is able to list/count the retained snapshots for each volume, and remove the old ones onces the limit is exceeded.

Check failure on line 654 in design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

View workflow job for this annotation

GitHub Actions / Run Codespell

onces ==> ounces, once, ones
Practically, CMP creates a snapshot associated object(SASOO) for each retained snapshot, then it could list/count the retained snapshots by listing/counting the SASOOs through Kubernetes API any time after the backup. For some CMPs, some Kubernetes objects are created as part of the snapshot creation, then they could choose any of the objects as SASOOs. For others, if no Kubernetes objects are created necessarily during the snapshot creation, they can create some Kubernetes objects (i.e. configMaps) purposefully as SASOOs. For Velero CSI plugin, the VolumeSnapshotContent objects will act as SASOOs.
To assist on the listing/counting, several labels are applied to SASOOs:
- ```velero.io/snapshot-alias-name```: The CMP gives an alias to each SASOO. As mentioned above, Velero/CMP should not expect DMs to keep the snapshots and their associated objects unchanged, e.g., a DM may delete the VS/VSC created by CSI plugin and create new ones from them (so the new ones also represent the same snapshots). By giving an alias, CMPs are guaranteed to find the SASOOs by the alias label even though they have no idea to know where the DMs have cloned the SASOOs. This also means that DMs should inherit all the labels from the original SASOOs
- ```velero.io/snapshot-volume-name```: This label stores the volume's identity for which a snapshot is created. For example, the value could be a PVC's namespaced name or UID. In this way, the CMP is able to tell all the retained snapshots for a specific volume

DMPs then return the SASOOs as ```itemToUpdate``` with their aliases to Velero backup, in this way, when the DM execution finishes, Velero gets the latest SASOOs by their aliases and persist them to the backup storage.
Velero, specifically the backup sync controller, is resposible to sync the SASOOs to the target cluster for restore if they are not there. In this way, it is guaranteed that all the SASOOs are available as long as their associated backups are there, or for every restore, DMPs always see the full list of SASOOs for all the volumes to be restored.

Check failure on line 661 in design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

View workflow job for this annotation

GitHub Actions / Run Codespell

resposible ==> responsible

After the retained snapshots are counted against the limit and if limit is exceeded, DMP needs to make sure the old snapshots are first removed before creating the new snapshot, otherwise, the snapshot creation may fail as some storage has a hard limit of the snapshot numbers.

DMPs also need to tell DMs to retain a snasphot, it is done through the ```retainedSnapshot``` field in the DUCR's spec. This is a boolean value and if it is true, DMs will not delete the snapshots after its execution finishes.

During restore, DMPs checks the existence of the SASOO for a volume of a backup, if it exists, it restore the data from the native snapshot; otherwise, it submits DDCRs to do a data movement.

For Velero CSI plugin, the existing logics are reused for restoring retained native snapshots, with some adjustment in the workflow.
Below diagram shows how snapshot retained backup happens for Velero and Velero CSI plugin:
![backup-sequence-retained-snapshot.png](backup-sequence-retained-snapshot.png)
Below diagram shows how snapshot retained restore happens for Velero and Velero CSI plugin when native snapshot is available:
![restore-sequence-retained-snapshot-not-avai.png](restore-sequence-retained-snapshot-not-avai.png)
Below diagram shows how snapshot retained restore happens for Velero and Velero CSI plugin when native snapshot is not available:
![restore-sequence-retained-snapshot-avai.png](restore-sequence-retained-snapshot-avai.png)

## Backup Sync
DUCR contains the information that is required during restore but as mentioned above, it will not be synced because during restore its information is retrieved dynamically. Therefore, we have no change to Backup Sync.
For snapshot retained, as mentioned above, the backup sync controller finds the SASOOs for a backup by the ```velero.io/snapshot-alias-name``` in the backup storage and sync them to the target cluster during backup sync. For CSI snapshot, Velero has existing logics to sync VolumeSnapshotContents, this logic will be reused.


## Backup Deletion
Once a backup is deleted, the data in the backup repository should be deleted as well. On the other hand, the data is created by the specific DM, Velero doesn't know how to delete the data. Therefore, Velero relies on the DM to delete the backup data.
Expand All @@ -655,6 +693,9 @@ As the current workflow, when ```velero backup delete``` CLI is called, a ```del
- Otherwise, if any error happens during the processing, the ```deletebackuprequests``` CR will be left there with the ```velero.io/dm-delete-backup``` finalizer, as well as the failed DUCRs
- DMs may use a periodical manner to retry the failed delete requests

Once the backup is deleted, the native snapshots retained should also be deleted. Velero has the ability to delete the SASOOs as they are part of the backup, if any particular opreations is required for deleting the snapshots, the DMP who creates the snapshots needs to implement DIA (DeleteItemAction) on the SASOOs.

Check failure on line 696 in design/volume-snapshot-data-movement/volume-snapshot-data-movement.md

View workflow job for this annotation

GitHub Actions / Run Codespell

opreations ==> operations
For CSI snapshot, CSI plugin implements DIA on VolumeSnapshotContent objects, so that the snapshots could be removed appropriately.

## Restarts
If Velero restarts during a data movement activity, the backup/restore will be marked as failed when Velero server comes back, by this time, Velero will request a cancellation to the ongoing data movement.
If DM restarts, Velero has no way to detect this, DM is expected to:
Expand Down Expand Up @@ -890,11 +931,12 @@ Conclusively, below are the steps plugin DMs need to do in order to integrate to
- Set PV's ```claimRef``` to the provided PVC and set ```velero.io/dynamic-pv-restore``` label
## Working Mode
It doesn’t mean that once the data movement feature is enabled users must move every snapshot. We will support below two working modes:
It doesn’t mean that once the data movement feature is enabled users must move every snapshot. We will support below three working modes:
- Don’t move snapshots. This is same with the existing CSI snapshot feature, that is, native snapshots are taken and kept
- Move snapshot data and delete native snapshots. This means that once the data movement completes, the native snapshots will be deleted.
- Move snapshot data and keep X native snapshtos. This means snapshot data is moved first and also several native snapshots will be kept according to users' configuration.
For this purpose, we need to add a new option in the backup command as well as the Backup CRD.
For this purpose, we need to add new options in the backup command, as the Backup CRD and Velero server parameters.
The same option for restore will be retrieved from the specified backup, so that the working mode is consistent.
## Backup and Restore CRD Changes
Expand All @@ -909,11 +951,19 @@ We add below new fields in the Backup CRD:
// If DataMover is "" or "velero", the built-in data mover will be used.
// +optional
DataMover string `json:"datamover,omitempty"`

// RetainSnapshot specifies whether to retain the snapshot after backup completes.
// +optional
RetainSnapshot bool `json:"retainSnapshot,omitempty"`
```
SnapshotMoveData will be used to decide the Working Mode.
DataMover will be used to decide the data mover to handle the DUCR. DUCR's DataMover value is derived from this value.
DataMover will be used to decide the data mover to handle the DUCR. DUCR's DataMover value is derived from this value.
RetainSnapshot is used to tell the DM whether the native snapshots should be retained.
As mentioned in the Plugin Data Movers section, the data movement information for a restore should be the same with the backup. Therefore, the working mode for restore should be decided by checking the corresponding Backup CR; when creating a DDCR, the DataMover value should be retrieved from the corresponding Backup Result; and the retained snapshots should be retrieved from the SASOOs synced along with the backup, so no changes are required for the Restore CRD.
As mentioned in the Plugin Data Movers section, the data movement information for a restore should be the same with the backup. Therefore, the working mode for restore should be decided by checking the corresponding Backup CR; when creating a DDCR, the DataMover value should be retrieved from the corresponding Backup Result.
## Velero Server parameter Changes
We add a new flag to Velero server parameter ```data-mover-snapshot-to-retain``` as a global configuration for users to specify how many native snapshots should be retain for each volume. For more information of how native snapshot retain works, check the [Retain Local Snapshot] section.
## Logging
The logs during the data movement are categorized as below:
Expand Down

0 comments on commit 3d72d25

Please sign in to comment.