-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design for data mover preserve local snapshot #7002
Add design for data mover preserve local snapshot #7002
Conversation
3d72d25
to
fc7aea4
Compare
558b87d
to
a568e57
Compare
Codecov Report
@@ Coverage Diff @@
## main #7002 +/- ##
==========================================
- Coverage 61.05% 61.05% -0.01%
==========================================
Files 251 252 +1
Lines 26846 26855 +9
==========================================
+ Hits 16390 16395 +5
- Misses 9303 9305 +2
- Partials 1153 1155 +2 |
b57c779
to
59ab6fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the source diagram file export (non png/jpg) to PR as well? making future edits if any easier.
- If the number is not specified, or if it is not a positive value, it means no native snapshot is to be retained | ||
- Otherwise, the value defines the max snapshots to be retained for each volume, a.k.a, the limit. If the limit is exceeded, the oldest retained snapshot will be removed | ||
|
||
The limit number is set as an annotation of a backup CR, the annotation is called ```backup.velero.io/data-mover-snapshot-to-retain```. In this way, CMPs are able to query the number from the backup CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what CMP stands for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is backup.velero.io/data-mover-snapshot-to-retain
for querying purposes only and populated by velero? or can user set this value at backup CR creation to modify the behavior per backup?
Should we allow per pvc annotation settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected it, it should be DMP
, thanks for the reminding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The annotation is for querying purposes only, we don't expect users to config it, they should config the server parameter instead.
Actually, users don't have opportunity to config it, as it is added when creating the backup CR and consumed immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't see any usability to configure the number of retained snapshot per PVC. So it is not supported.
design/volume-snapshot-data-movement/volume-snapshot-data-movement.md
Outdated
Show resolved
Hide resolved
59ab6fb
to
a28be47
Compare
design/volume-snapshot-data-movement/volume-snapshot-data-movement.md
Outdated
Show resolved
Hide resolved
Practically, DMP creates a snapshot associated object(SAO) for each retained snapshot, then it could list/count the retained snapshots by listing/counting the SAOs through Kubernetes API any time after the backup. For some DMPs, some Kubernetes objects are created as part of the snapshot creation, then they could choose any of the objects as SAOs. For others, if no Kubernetes objects are created necessarily during the snapshot creation, they can create some Kubernetes objects (i.e. configMaps) purposefully as SAOs. For Velero CSI plugin, the VolumeSnapshotContent objects will act as SAOs. | ||
To assist on the listing/counting, several labels are applied to SAOs: | ||
- ```velero.io/snapshot-alias-name```: The DMP gives an alias to each SAO. As mentioned above, Velero/DMP should not expect DMs to keep the snapshots and their associated objects unchanged, e.g., a DM may delete the VS/VSC created by CSI plugin and create new ones from them (so the new ones also represent the same snapshots). By giving an alias, DMPs are guaranteed to find the SAOs by the alias label even though they have no idea to know where the DMs have cloned the SAOs. This also means that DMs should inherit all the labels from the original SAOs | ||
- ```velero.io/snapshot-volume-name```: This label stores the volume's identity for which a snapshot is created. For example, the value could be a PVC's namespaced name or UID. In this way, the DMP is able to tell all the retained snapshots for a specific volume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a separate label for UID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we don't want to limit the content of this label's value, we only define what it means and its usage --- the value could be anything as long as it could uniquely identify a volume.
I have update the same in the doc.
a28be47
to
96feb7e
Compare
- Don’t move snapshots. This is same with the existing CSI snapshot feature, that is, native snapshots are taken and kept | ||
- Move snapshot data and delete native snapshots. This means that once the data movement completes, the native snapshots will be deleted. | ||
- Move snapshot data and keep X native snapshtos. This means snapshot data is moved first and also several native snapshots will be kept according to users' configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Move snapshot data and keep X native snapshtos. This means snapshot data is moved first and also several native snapshots will be kept according to users' configuration. | |
- Move snapshot data and keep X native snapshots for each volume. This means snapshot data is moved first and also several native snapshots will be kept according to users' velero server parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, let's keep the users' configuration
part, the way for users to do the configuration may change, for example, we may add a installation parameter in future. So let's keep this general statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done on the part snapshots for each volume
. Thanks.
96feb7e
to
97a1755
Compare
DMPs then return the SAOs as ```itemToUpdate``` with their aliases to Velero backup, in this way, when the DM execution finishes, Velero gets the latest SAOs by their aliases and persist them to the backup storage. | ||
Velero, specifically the backup sync controller, is responsible to sync the SAOs to the target cluster for restore if they are not there. In this way, it is guaranteed that all the SAOs are available as long as their associated backups are there, or for every restore, DMPs always see the full list of SAOs for all the volumes to be restored. | ||
|
||
After the retained snapshots are counted against the limit and if limit is exceeded, DMP needs to make sure the old snapshots are first removed before creating the new snapshot, otherwise, the snapshot creation may fail as some storage has a hard limit of the snapshot numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that storage snapshot limit will cause snapshot creation to fail. While if the backup finally fails, deleting the old snapshot firstly may cause the user to lose a recovery point.
What will be the behavior if the configured global limit number > the storage supported snapshot numbers? does it make sense to limit the global number to a relative reasonable value (a number that most storages won't exceed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the backup finally fails, deleting the old snapshot firstly may cause the user to lose a recovery point
We delete the native snapshots from an old completed backup. Since the backup is completed, the data mover has also completed. It means, a completed backup contains native snapshots + data mover backup data.
So here we delete the native snapshots for the backup only, it won't loose recovery point. The recovery point will refer to the data mover data.
This approach is actually to avoid the situation that the snapshot creation fails due to snapshot limit for the current backup, so it delete the native snapshot of the old backup first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the behavior if the configured global limit number > the storage supported snapshot numbers?
The snapshot creation will fail if the number of snapshots exceeds the storage limit. This means, users need to be aware of their storage limit and set a number < the storage limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to limit the global number to a relative reasonable value (a number that most storages won't exceed)
I am afraid we cannot deduce a number that works for most storages. The snapshot limit is storage specific. Actually, for a storage, there is no technical point to support a fixed number as the limit, it is a compromise of resource consumption and performance, so generally this number is configurable on the storage side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add some part to discuss the storage limit in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done on the addition.
97a1755
to
a866991
Compare
5e3762a
to
5941562
Compare
## Retain Native Snapshots | ||
Users are allowed to specify a global limit number of native snapshots to be retained for each volume, then: | ||
- If the number is not specified, or if it is not a positive value, it means no native snapshot is to be retained | ||
- Otherwise, the value defines the max snapshots to be retained for each volume, a.k.a, the limit. If the limit is exceeded, the oldest retained snapshot will be removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a nuance here of setting limits per Driver type.
for example disk.csi.cloud.com might support retaining upto 100snapshots, while file.csi.cloud.com might only support 10 and backup might contain both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add this as a CSI plugin specific configuration overwriting the global number, as the CSI Driver is not generic for all DMPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added below statement to the design:
For the same reason, if a DMP supports snapshot from more than one kinds of storages, it may need to set different snapshot limits for volumes from different storages, based on different criteria (E.g., CSI plugin may needs to set different numbers by CSI driver types). In this case, the DMP could overwrite the global limit number based on its own configurations. The configurations are private to the DMP itself since the storage information is only known by the specific DMP, and how the configurations are structured and interpreted is also private to the specific DMP.
This indicates that the DMP specific configuration is supported to overwrite the global number and the configuration is managed by the DMP itself.
So we don't need to give a generic configuration structure for all DMPs.
And for CSI plugin, I don't want to mention the concrete configuration structure for now, the requirement is still obscure right now, e.g., sometimes we want to differentiate it by CSI driver type, sometimes we want to do it by volumeType, etc. Let's wait and see the future requirements after this feature is release, and then we come back to design the structure for CSI.
@anshulahuja98 Let me know you are good with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be good enough for now.
Will keep this comment open for others to have a look. Otherwise consider this resolved for now.
DMP is responsible to maintain the native snapshots. Specifically, it is able to list/count the retained snapshots for each volume, and remove the old ones once the limit is exceeded. | ||
Practically, DMP creates a snapshot associated object(SAO) for each retained snapshot, then it could list/count the retained snapshots by listing/counting the SAOs through Kubernetes API any time after the backup. For some DMPs, some Kubernetes objects are created as part of the snapshot creation, then they could choose any of the objects as SAOs. For others, if no Kubernetes objects are created necessarily during the snapshot creation, they can create some Kubernetes objects (i.e. configMaps) purposefully as SAOs. For Velero CSI plugin, the VolumeSnapshotContent objects will act as SAOs. | ||
To assist on the listing/counting, several labels are applied to SAOs: | ||
- ```velero.io/snapshot-alias-name```: The DMP gives an alias to each SAO. As mentioned above, Velero/DMP should not expect DMs to keep the snapshots and their associated objects unchanged, e.g., a DM may delete the VS/VSC created by CSI plugin and create new ones from them (so the new ones also represent the same snapshots). By giving an alias, DMPs are guaranteed to find the SAOs by the alias label even though they have no idea to know where the DMs have cloned the SAOs. This also means that DMs should inherit all the labels from the original SAOs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the snapshot alias name be an SAO's name or a snapshot name?
Since it's inherited from the original one, what is the original SAO's snapshot alias name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
itemToUpdate
is returned from DMP, DMP doesn't exchange the SAO's name with DMs, while DMs will could create the SAO on its own, they DMP will not be able to know the exact name.
The alias could be anything that uniquely identify a SAO, e.g., a random UID.
004841e
to
0fa36d3
Compare
Signed-off-by: Lyndon-Li <[email protected]>
0fa36d3
to
c0878bf
Compare
Design for #6240. Add design for data mover preserve local snapshot