Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design for data mover preserve local snapshot #7002

Conversation

Lyndon-Li
Copy link
Contributor

@Lyndon-Li Lyndon-Li commented Oct 23, 2023

Design for #6240. Add design for data mover preserve local snapshot

@Lyndon-Li Lyndon-Li changed the title add design for data mover preserve local snapshot Add design for data mover preserve local snapshot Oct 23, 2023
@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch from 3d72d25 to fc7aea4 Compare October 23, 2023 10:06
@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch 4 times, most recently from 558b87d to a568e57 Compare October 23, 2023 13:14
@codecov
Copy link

codecov bot commented Oct 23, 2023

Codecov Report

Merging #7002 (c0878bf) into main (5fe53da) will decrease coverage by 0.01%.
Report is 18 commits behind head on main.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #7002      +/-   ##
==========================================
- Coverage   61.05%   61.05%   -0.01%     
==========================================
  Files         251      252       +1     
  Lines       26846    26855       +9     
==========================================
+ Hits        16390    16395       +5     
- Misses       9303     9305       +2     
- Partials     1153     1155       +2     

see 11 files with indirect coverage changes

@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch 3 times, most recently from b57c779 to 59ab6fb Compare October 23, 2023 13:36
@Lyndon-Li Lyndon-Li marked this pull request as ready for review October 23, 2023 13:37
Copy link
Contributor

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the source diagram file export (non png/jpg) to PR as well? making future edits if any easier.

- If the number is not specified, or if it is not a positive value, it means no native snapshot is to be retained
- Otherwise, the value defines the max snapshots to be retained for each volume, a.k.a, the limit. If the limit is exceeded, the oldest retained snapshot will be removed

The limit number is set as an annotation of a backup CR, the annotation is called ```backup.velero.io/data-mover-snapshot-to-retain```. In this way, CMPs are able to query the number from the backup CR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what CMP stands for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is backup.velero.io/data-mover-snapshot-to-retain for querying purposes only and populated by velero? or can user set this value at backup CR creation to modify the behavior per backup?

Should we allow per pvc annotation settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected it, it should be DMP, thanks for the reminding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation is for querying purposes only, we don't expect users to config it, they should config the server parameter instead.
Actually, users don't have opportunity to config it, as it is added when creating the backup CR and consumed immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't see any usability to configure the number of retained snapshot per PVC. So it is not supported.

@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch from 59ab6fb to a28be47 Compare October 24, 2023 00:46
Practically, DMP creates a snapshot associated object(SAO) for each retained snapshot, then it could list/count the retained snapshots by listing/counting the SAOs through Kubernetes API any time after the backup. For some DMPs, some Kubernetes objects are created as part of the snapshot creation, then they could choose any of the objects as SAOs. For others, if no Kubernetes objects are created necessarily during the snapshot creation, they can create some Kubernetes objects (i.e. configMaps) purposefully as SAOs. For Velero CSI plugin, the VolumeSnapshotContent objects will act as SAOs.
To assist on the listing/counting, several labels are applied to SAOs:
- ```velero.io/snapshot-alias-name```: The DMP gives an alias to each SAO. As mentioned above, Velero/DMP should not expect DMs to keep the snapshots and their associated objects unchanged, e.g., a DM may delete the VS/VSC created by CSI plugin and create new ones from them (so the new ones also represent the same snapshots). By giving an alias, DMPs are guaranteed to find the SAOs by the alias label even though they have no idea to know where the DMs have cloned the SAOs. This also means that DMs should inherit all the labels from the original SAOs
- ```velero.io/snapshot-volume-name```: This label stores the volume's identity for which a snapshot is created. For example, the value could be a PVC's namespaced name or UID. In this way, the DMP is able to tell all the retained snapshots for a specific volume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a separate label for UID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we don't want to limit the content of this label's value, we only define what it means and its usage --- the value could be anything as long as it could uniquely identify a volume.

I have update the same in the doc.

@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch from a28be47 to 96feb7e Compare October 24, 2023 02:55
- Don’t move snapshots. This is same with the existing CSI snapshot feature, that is, native snapshots are taken and kept
- Move snapshot data and delete native snapshots. This means that once the data movement completes, the native snapshots will be deleted.
- Move snapshot data and keep X native snapshtos. This means snapshot data is moved first and also several native snapshots will be kept according to users' configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Move snapshot data and keep X native snapshtos. This means snapshot data is moved first and also several native snapshots will be kept according to users' configuration.
- Move snapshot data and keep X native snapshots for each volume. This means snapshot data is moved first and also several native snapshots will be kept according to users' velero server parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, let's keep the users' configuration part, the way for users to do the configuration may change, for example, we may add a installation parameter in future. So let's keep this general statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on the part snapshots for each volume. Thanks.

@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch from 96feb7e to 97a1755 Compare October 25, 2023 03:35
DMPs then return the SAOs as ```itemToUpdate``` with their aliases to Velero backup, in this way, when the DM execution finishes, Velero gets the latest SAOs by their aliases and persist them to the backup storage.
Velero, specifically the backup sync controller, is responsible to sync the SAOs to the target cluster for restore if they are not there. In this way, it is guaranteed that all the SAOs are available as long as their associated backups are there, or for every restore, DMPs always see the full list of SAOs for all the volumes to be restored.

After the retained snapshots are counted against the limit and if limit is exceeded, DMP needs to make sure the old snapshots are first removed before creating the new snapshot, otherwise, the snapshot creation may fail as some storage has a hard limit of the snapshot numbers.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that storage snapshot limit will cause snapshot creation to fail. While if the backup finally fails, deleting the old snapshot firstly may cause the user to lose a recovery point.

What will be the behavior if the configured global limit number > the storage supported snapshot numbers? does it make sense to limit the global number to a relative reasonable value (a number that most storages won't exceed)?

Copy link
Contributor Author

@Lyndon-Li Lyndon-Li Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the backup finally fails, deleting the old snapshot firstly may cause the user to lose a recovery point

We delete the native snapshots from an old completed backup. Since the backup is completed, the data mover has also completed. It means, a completed backup contains native snapshots + data mover backup data.
So here we delete the native snapshots for the backup only, it won't loose recovery point. The recovery point will refer to the data mover data.
This approach is actually to avoid the situation that the snapshot creation fails due to snapshot limit for the current backup, so it delete the native snapshot of the old backup first.

Copy link
Contributor Author

@Lyndon-Li Lyndon-Li Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be the behavior if the configured global limit number > the storage supported snapshot numbers?

The snapshot creation will fail if the number of snapshots exceeds the storage limit. This means, users need to be aware of their storage limit and set a number < the storage limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to limit the global number to a relative reasonable value (a number that most storages won't exceed)

I am afraid we cannot deduce a number that works for most storages. The snapshot limit is storage specific. Actually, for a storage, there is no technical point to support a fixed number as the limit, it is a compromise of resource consumption and performance, so generally this number is configurable on the storage side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add some part to discuss the storage limit in the doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on the addition.

@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch from 97a1755 to a866991 Compare October 25, 2023 04:18
@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch 2 times, most recently from 5e3762a to 5941562 Compare October 25, 2023 04:34
## Retain Native Snapshots
Users are allowed to specify a global limit number of native snapshots to be retained for each volume, then:
- If the number is not specified, or if it is not a positive value, it means no native snapshot is to be retained
- Otherwise, the value defines the max snapshots to be retained for each volume, a.k.a, the limit. If the limit is exceeded, the oldest retained snapshot will be removed
Copy link
Collaborator

@anshulahuja98 anshulahuja98 Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a nuance here of setting limits per Driver type.
for example disk.csi.cloud.com might support retaining upto 100snapshots, while file.csi.cloud.com might only support 10 and backup might contain both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add this as a CSI plugin specific configuration overwriting the global number, as the CSI Driver is not generic for all DMPs.

Copy link
Contributor Author

@Lyndon-Li Lyndon-Li Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added below statement to the design:
For the same reason, if a DMP supports snapshot from more than one kinds of storages, it may need to set different snapshot limits for volumes from different storages, based on different criteria (E.g., CSI plugin may needs to set different numbers by CSI driver types). In this case, the DMP could overwrite the global limit number based on its own configurations. The configurations are private to the DMP itself since the storage information is only known by the specific DMP, and how the configurations are structured and interpreted is also private to the specific DMP.

This indicates that the DMP specific configuration is supported to overwrite the global number and the configuration is managed by the DMP itself.
So we don't need to give a generic configuration structure for all DMPs.
And for CSI plugin, I don't want to mention the concrete configuration structure for now, the requirement is still obscure right now, e.g., sometimes we want to differentiate it by CSI driver type, sometimes we want to do it by volumeType, etc. Let's wait and see the future requirements after this feature is release, and then we come back to design the structure for CSI.

@anshulahuja98 Let me know you are good with this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be good enough for now.
Will keep this comment open for others to have a look. Otherwise consider this resolved for now.

DMP is responsible to maintain the native snapshots. Specifically, it is able to list/count the retained snapshots for each volume, and remove the old ones once the limit is exceeded.
Practically, DMP creates a snapshot associated object(SAO) for each retained snapshot, then it could list/count the retained snapshots by listing/counting the SAOs through Kubernetes API any time after the backup. For some DMPs, some Kubernetes objects are created as part of the snapshot creation, then they could choose any of the objects as SAOs. For others, if no Kubernetes objects are created necessarily during the snapshot creation, they can create some Kubernetes objects (i.e. configMaps) purposefully as SAOs. For Velero CSI plugin, the VolumeSnapshotContent objects will act as SAOs.
To assist on the listing/counting, several labels are applied to SAOs:
- ```velero.io/snapshot-alias-name```: The DMP gives an alias to each SAO. As mentioned above, Velero/DMP should not expect DMs to keep the snapshots and their associated objects unchanged, e.g., a DM may delete the VS/VSC created by CSI plugin and create new ones from them (so the new ones also represent the same snapshots). By giving an alias, DMPs are guaranteed to find the SAOs by the alias label even though they have no idea to know where the DMs have cloned the SAOs. This also means that DMs should inherit all the labels from the original SAOs
Copy link
Contributor

@blackpiglet blackpiglet Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the snapshot alias name be an SAO's name or a snapshot name?
Since it's inherited from the original one, what is the original SAO's snapshot alias name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itemToUpdate is returned from DMP, DMP doesn't exchange the SAO's name with DMs, while DMs will could create the SAO on its own, they DMP will not be able to know the exact name.

The alias could be anything that uniquely identify a SAO, e.g., a random UID.

@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch 4 times, most recently from 004841e to 0fa36d3 Compare October 26, 2023 02:14
@Lyndon-Li Lyndon-Li force-pushed the data-mover-preserve-local-snapshot-design branch from 0fa36d3 to c0878bf Compare October 26, 2023 02:14
@Lyndon-Li Lyndon-Li closed this May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants