-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance backup ready conditions #619
Enhance backup ready conditions #619
Conversation
088a2a8
to
82aa681
Compare
/invite @unmarshall |
@unmarshall You have pull request review open invite, please check |
@seshachalam-yv thanks for your review. I've addressed your comment, in a slightly different way than you suggested. But overall, readability has improved. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from one small nitpick, overall, this PR looks great to me. I appreciate the time and effort you've put into addressing all the comments and suggestions. I am impressed with the changes you've made in this PR. It not only enhances the code's readability but also provides a clear and logical flow. Excellent job! 😍
@seshachalam-yv I've addressed your follow-up suggestion as well. Thanks for the detailed suggestions! |
/test pull-etcd-druid-e2e-kind |
// Fetch snapshot leases | ||
fullSnapshotLease, err := a.fetchLease(ctx, etcd.GetFullSnapshotLeaseName(), etcd.Namespace) | ||
if err != nil { | ||
return createBackupConditionResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no real need for createBackupConditionResult
function. You do not save on the number of lines of code, in-fact you have more lines of code :) and there is no readability improvement over just creating an instance of a struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conType: druidv1alpha1.ConditionTypeBackupReady
is common to all backupReady condition result, so it made sense to pull it out into a separate function, just to avoid adding the conType
every single time when returning. Of course, the previous method was to create a default result at the beginning of the function and then simply change the values when returning, but @seshachalam-yv pointed out that it was not the most readable, since one has to check the default result as well as the changed values to figure out the final result being returned.
if err != nil { | ||
return createBackupConditionResult( | ||
druidv1alpha1.ConditionUnknown, Unknown, | ||
fmt.Sprintf("Unable to fetch delta snap lease. %s", err.Error()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you wish to include err.Error()
as part of the message or just log it using logger.Error
? How large is err.Error()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason Unknown
is quite vague. Current set of conditions that i see on a typical etcd resource are as follows:
conditions:
- lastTransitionTime: "2023-07-05T03:55:06Z"
lastUpdateTime: "2023-07-05T05:02:52Z"
message: All members are ready
reason: AllMembersReady
status: "True"
type: AllMembersReady
- lastTransitionTime: "2023-07-05T04:00:02Z"
lastUpdateTime: "2023-07-05T05:02:52Z"
message: Snapshot backup succeeded
reason: BackupSucceeded
status: "True"
type: BackupReady
- lastTransitionTime: "2023-07-05T03:55:06Z"
lastUpdateTime: "2023-07-05T05:02:52Z"
message: The majority of ETCD members is ready
reason: Quorate
status: "True"
type: Ready
If you see the reason
clearly indicates what that condition is for. So having a reason as Unknown
would be un-qualified and therefore very hard to reason or disambiguate or even process later. In your proposal one has to look at the message to learn more about the condition which is a departure from the existing set of conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check
method returns a single Result. If there are problems fetching both the leases then you will always return the condition with message for delta, thereby masking full-snapshot lease condition message. Would it make sense to have different conditions - one for delta and another for full snapshot?
This also allows you to separately capture a message when you are unable to compute the full snapshot duration. This will then not affect the condition for delta snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when lease itself is NotFound
? Should you not have a different message indicating that the lease itself is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you wish to include err.Error() as part of the message or just log it using logger.Error? How large is err.Error()
It's safer to add the error to the condition, just so that it's visible to an operator without having to sift through logs. Even gardener shoot conditions for instance store the error message in the condition, which are printed to the dashboard as well. It's quite helpful for operators and users alike.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you see the reason clearly indicates what that condition is for. So having a reason as Unknown would be un-qualified and therefore very hard to reason or disambiguate or even process later. In your proposal one has to look at the message to learn more about the condition which is a departure from the existing set of conditions.
I'll try to add more meaningful reason strings then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check method returns a single Result. If there are problems fetching both the leases then you will always return the condition with message for delta, thereby masking full-snapshot lease condition message. Would it make sense to have different conditions - one for delta and another for full snapshot?
I've handled such cases specifically so that the full snapshot lease error does not get masked by delta snapshot lease error. If both leases have errors, both are captured in the condition, such as this and this.
Looks like only the case of failing to fetch the leases needs to be handled more robustly. I'll handle this then, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when lease itself is NotFound? Should you not have a different message indicating that the lease itself is missing?
Right now, it's a blanket message of Unable to fetch full/delta snap lease: <error-string>
, which is still technically correct. The error string holds the reason as to why the fetch failed, and it will specify that lease not found. If you want, I can separate out the lease-not-found case and use a separate Reason
string for that like LeaseNotFound
. WDYT?
isFullSnapshotLeaseStale := isLeaseStale(fullSnapshotLeaseRenewTime, fullSnapshotLeaseRenewalGracePeriod) | ||
isDeltaSnapshotLeaseStale := isLeaseStale(deltaSnapshotLeaseRenewTime, deltaSnapshotLeaseRenewalGracePeriod) | ||
|
||
if isFullSnapshotLeaseStale && !isDeltaSnapshotLeaseStale { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you create separate conditions for full and delta snapshots then these checks would get simplified and it will let you capture these 2 conditions independently. BackupsReady can then be a derived condition which could look at delta and full snapshot conditions if at all you require a single condition for all backups.
isDeltaSnapshotLeaseStale := isLeaseStale(deltaSnapshotLeaseRenewTime, deltaSnapshotLeaseRenewalGracePeriod) | ||
|
||
// Delta snapshot lease is stale, while staleness of full snapshot lease cannot be determined yet | ||
if isDeltaSnapshotLeaseStale && wasFullSnapshotLeaseCreatedRecently { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has become quite complicated. Can be simplified by just having 2 conditions. Then we just require 2 functions overall - one to properly update full snapshot lease status and another to update delta snapshot lease status
// even though the full snapshot may have succeeded within the required time, we must still wait | ||
// for delta snapshotting to begin to consider the backups as healthy, to maintain the given RPO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure I agree with this
I think that if a full snapshot has been taken and that is within the deltaSnapshotRenewalGracePeriod
then backup status should be BackupSucceeded
as this still maintains our RPO for that instant and makes semantic sense
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally have an issue with having a single condition for delta + full snapshot backup. It will become a LOT easier if we have separate conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the reason we pass the deltaSnapshotRenewalGracePeriod
as 5*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration
, to allow backup sidecar that much time to start delta snapshotting. It depends on how we define RPO, and right now RPO loosely means the delta snapshot period (1x). I've removed the mention of RPO
in the comment to avoid any ambiguity, since we still don't define an official RPO for etcds managed by druid, so it doesn't make sense to bake that into the code now, until we have more clarity.
/hold |
@shreyas-s-rao: The following tests failed, say
Full PR test history. Your PR dashboard. Command help for this repository. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@shreyas-s-rao You need rebase this pull request with latest master branch. Please check. |
After an out-of-band discussion amongst myself, @unmarshall , @seshachalam-yv and @aaronfern , we concluded that it is not simple to handle all cases of successful, failed, skipped, missed snapshots by etcd-backup-restore, as well as missed renewals of the snapshot lease. Instead, we will solve this holistically as part of #702 , where the EtcdMember This PR will be closed in favour of #729 , which focuses on fixing the problem of hardcoded value of |
How to categorize this PR?
/area backup
/area usability
/kind enhancement
What this PR does / why we need it:
BackupReady
condition reason toBackupFailed
no longer depends on previous condition (previously, setting this depended on whether previous condition was failed or unknown, which meant that if previous condition was succeeded, we would never set condition to failed unless either of the leases were recreated. This behavior is now fixed and made fully deterministic)BackupReady
condition24h
for calculating staleness of full snapshot lease, by computing the "schedule duration" (duration between two activations of the cron, assuming activations are equal durations apart) from full snapshot cron scheduleWhich issue(s) this PR fixes:
Fixes #618
Special notes for your reviewer:
Release note: