Enhance backup ready conditions #619

shreyas-s-rao · 2023-06-19T07:52:18Z

How to categorize this PR?

/area backup
/area usability
/kind enhancement

What this PR does / why we need it:

Enhance backup ready conditions to provide more fine-grained condition status messages and reasons based on different states of snapshot leases
- Set BackupReady condition reason to BackupFailed no longer depends on previous condition (previously, setting this depended on whether previous condition was failed or unknown, which meant that if previous condition was succeeded, we would never set condition to failed unless either of the leases were recreated. This behavior is now fixed and made fully deterministic)
- Fine-grained condition messages for operators to infer whether full, delta or both snapshot leases have problems with renewal
- Improve readability of handling of different cases when setting BackupReady condition
- Remove hardcoding of 24h for calculating staleness of full snapshot lease, by computing the "schedule duration" (duration between two activations of the cron, assuming activations are equal durations apart) from full snapshot cron schedule

Which issue(s) this PR fixes:
Fixes #618

Special notes for your reviewer:

Release note:

Enhance `BackupReady` conditions to allow for more fine-grained condition states, messages and reasons.

shreyas-s-rao · 2023-06-22T05:50:11Z

/invite @unmarshall

gardener-robot · 2023-06-25T03:02:27Z

@unmarshall You have pull request review open invite, please check

pkg/health/condition/check_backup_ready.go

Makefile

shreyas-s-rao · 2023-06-28T19:49:21Z

@seshachalam-yv thanks for your review. I've addressed your comment, in a slightly different way than you suggested. But overall, readability has improved. PTAL.

pkg/health/condition/check_backup_ready.go

seshachalam-yv

Apart from one small nitpick, overall, this PR looks great to me. I appreciate the time and effort you've put into addressing all the comments and suggestions. I am impressed with the changes you've made in this PR. It not only enhances the code's readability but also provides a clear and logical flow. Excellent job! 😍

shreyas-s-rao · 2023-06-30T18:00:05Z

@seshachalam-yv I've addressed your follow-up suggestion as well. Thanks for the detailed suggestions!

shreyas-s-rao · 2023-06-30T18:56:37Z

/test pull-etcd-druid-e2e-kind

pkg/health/condition/builder.go

pkg/health/condition/check_backup_ready.go

unmarshall · 2023-07-05T05:18:02Z

pkg/health/condition/check_backup_ready.go

+	// Fetch snapshot leases
+	fullSnapshotLease, err := a.fetchLease(ctx, etcd.GetFullSnapshotLeaseName(), etcd.Namespace)
+	if err != nil {
+		return createBackupConditionResult(


There is no real need for createBackupConditionResult function. You do not save on the number of lines of code, in-fact you have more lines of code :) and there is no readability improvement over just creating an instance of a struct.

conType: druidv1alpha1.ConditionTypeBackupReady is common to all backupReady condition result, so it made sense to pull it out into a separate function, just to avoid adding the conType every single time when returning. Of course, the previous method was to create a default result at the beginning of the function and then simply change the values when returning, but @seshachalam-yv pointed out that it was not the most readable, since one has to check the default result as well as the changed values to figure out the final result being returned.

unmarshall · 2023-07-05T05:19:56Z

pkg/health/condition/check_backup_ready.go

+	if err != nil {
+		return createBackupConditionResult(
+			druidv1alpha1.ConditionUnknown, Unknown,
+			fmt.Sprintf("Unable to fetch delta snap lease. %s", err.Error()),


do you wish to include err.Error() as part of the message or just log it using logger.Error? How large is err.Error()

Reason Unknown is quite vague. Current set of conditions that i see on a typical etcd resource are as follows:

conditions: - lastTransitionTime: "2023-07-05T03:55:06Z" lastUpdateTime: "2023-07-05T05:02:52Z" message: All members are ready reason: AllMembersReady status: "True" type: AllMembersReady - lastTransitionTime: "2023-07-05T04:00:02Z" lastUpdateTime: "2023-07-05T05:02:52Z" message: Snapshot backup succeeded reason: BackupSucceeded status: "True" type: BackupReady - lastTransitionTime: "2023-07-05T03:55:06Z" lastUpdateTime: "2023-07-05T05:02:52Z" message: The majority of ETCD members is ready reason: Quorate status: "True" type: Ready

If you see the reason clearly indicates what that condition is for. So having a reason as Unknown would be un-qualified and therefore very hard to reason or disambiguate or even process later. In your proposal one has to look at the message to learn more about the condition which is a departure from the existing set of conditions.

Check method returns a single Result. If there are problems fetching both the leases then you will always return the condition with message for delta, thereby masking full-snapshot lease condition message. Would it make sense to have different conditions - one for delta and another for full snapshot?

This also allows you to separately capture a message when you are unable to compute the full snapshot duration. This will then not affect the condition for delta snapshot.

What happens when lease itself is NotFound? Should you not have a different message indicating that the lease itself is missing?

do you wish to include err.Error() as part of the message or just log it using logger.Error? How large is err.Error()

It's safer to add the error to the condition, just so that it's visible to an operator without having to sift through logs. Even gardener shoot conditions for instance store the error message in the condition, which are printed to the dashboard as well. It's quite helpful for operators and users alike.

If you see the reason clearly indicates what that condition is for. So having a reason as Unknown would be un-qualified and therefore very hard to reason or disambiguate or even process later. In your proposal one has to look at the message to learn more about the condition which is a departure from the existing set of conditions.

I'll try to add more meaningful reason strings then.

Check method returns a single Result. If there are problems fetching both the leases then you will always return the condition with message for delta, thereby masking full-snapshot lease condition message. Would it make sense to have different conditions - one for delta and another for full snapshot?

I've handled such cases specifically so that the full snapshot lease error does not get masked by delta snapshot lease error. If both leases have errors, both are captured in the condition, such as this and this.
Looks like only the case of failing to fetch the leases needs to be handled more robustly. I'll handle this then, thanks

What happens when lease itself is NotFound? Should you not have a different message indicating that the lease itself is missing?

Right now, it's a blanket message of Unable to fetch full/delta snap lease: <error-string>, which is still technically correct. The error string holds the reason as to why the fetch failed, and it will specify that lease not found. If you want, I can separate out the lease-not-found case and use a separate Reason string for that like LeaseNotFound. WDYT?

pkg/utils/miscellaneous.go

pkg/health/condition/check_backup_ready.go

unmarshall · 2023-07-05T07:43:16Z

pkg/health/condition/check_backup_ready.go

+	isFullSnapshotLeaseStale := isLeaseStale(fullSnapshotLeaseRenewTime, fullSnapshotLeaseRenewalGracePeriod)
+	isDeltaSnapshotLeaseStale := isLeaseStale(deltaSnapshotLeaseRenewTime, deltaSnapshotLeaseRenewalGracePeriod)
+
+	if isFullSnapshotLeaseStale && !isDeltaSnapshotLeaseStale {


if you create separate conditions for full and delta snapshots then these checks would get simplified and it will let you capture these 2 conditions independently. BackupsReady can then be a derived condition which could look at delta and full snapshot conditions if at all you require a single condition for all backups.

pkg/health/condition/check_backup_ready.go

unmarshall · 2023-07-05T07:51:41Z

pkg/health/condition/check_backup_ready.go

+	isDeltaSnapshotLeaseStale := isLeaseStale(deltaSnapshotLeaseRenewTime, deltaSnapshotLeaseRenewalGracePeriod)
+
+	// Delta snapshot lease is stale, while staleness of full snapshot lease cannot be determined yet
+	if isDeltaSnapshotLeaseStale && wasFullSnapshotLeaseCreatedRecently {


This has become quite complicated. Can be simplified by just having 2 conditions. Then we just require 2 functions overall - one to properly update full snapshot lease status and another to update delta snapshot lease status

aaronfern · 2023-07-06T09:21:22Z

pkg/health/condition/check_backup_ready.go

+// even though the full snapshot may have succeeded within the required time, we must still wait
+// for delta snapshotting to begin to consider the backups as healthy, to maintain the given RPO.


I'm not too sure I agree with this
I think that if a full snapshot has been taken and that is within the deltaSnapshotRenewalGracePeriod then backup status should be BackupSucceeded as this still maintains our RPO for that instant and makes semantic sense
wdyt?

I generally have an issue with having a single condition for delta + full snapshot backup. It will become a LOT easier if we have separate conditions.

That's the reason we pass the deltaSnapshotRenewalGracePeriod as 5*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration, to allow backup sidecar that much time to start delta snapshotting. It depends on how we define RPO, and right now RPO loosely means the delta snapshot period (1x). I've removed the mention of RPO in the comment to avoid any ambiguity, since we still don't define an official RPO for etcds managed by druid, so it doesn't make sense to bake that into the code now, until we have more clarity.

shreyas-s-rao · 2023-07-06T16:13:34Z

/hold
To be re-looked at from the perspective of having two separate conditions for FullBackupReady and DeltaBackupReady as suggested by @unmarshall .
/milestone v0.20.0

gardener-prow · 2023-07-31T10:19:54Z

@shreyas-s-rao: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-druid-e2e-kind	`9645733`	link	true	`/test pull-etcd-druid-e2e-kind`
pull-etcd-druid-e2e-kind-alpha-features	`9645733`	link	true	`/test pull-etcd-druid-e2e-kind-alpha-features`

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

gardener-robot · 2023-08-03T11:03:56Z

@shreyas-s-rao You need rebase this pull request with latest master branch. Please check.

shreyas-s-rao · 2023-11-29T13:15:11Z

After an out-of-band discussion amongst myself, @unmarshall , @seshachalam-yv and @aaronfern , we concluded that it is not simple to handle all cases of successful, failed, skipped, missed snapshots by etcd-backup-restore, as well as missed renewals of the snapshot lease. Instead, we will solve this holistically as part of #702 , where the EtcdMember status.snapshots.last[Full|Delta] will also include an additional field state, with possible values Succeeded, Failed or Skipped, to correctly reflect the state of the snapshot.

This PR will be closed in favour of #729 , which focuses on fixing the problem of hardcoded value of 24h for checking full snapshot staleness.
/close

shreyas-s-rao requested a review from a team as a code owner June 19, 2023 07:52

gardener-robot added area/backup Backup related area/usability Usability related kind/enhancement Enhancement, improvement, extension needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Jun 19, 2023

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 19, 2023

gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 19, 2023

Enhance backup ready conditions

82aa681

shreyas-s-rao force-pushed the enhance/backup-ready-conds branch from 088a2a8 to 82aa681 Compare June 19, 2023 16:15

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 19, 2023

gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 19, 2023

gardener-robot requested a review from unmarshall June 22, 2023 05:50

seshachalam-yv requested changes Jun 27, 2023

View reviewed changes

pkg/health/condition/check_backup_ready.go Outdated Show resolved Hide resolved

Makefile Show resolved Hide resolved

gardener-robot added the needs/changes Needs (more) changes label Jun 27, 2023

Address review comment from @seshachalam-yv

3a3f343

gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else and removed size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Jun 28, 2023

gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 28, 2023

seshachalam-yv reviewed Jun 30, 2023

View reviewed changes

pkg/health/condition/check_backup_ready.go Outdated Show resolved Hide resolved

seshachalam-yv approved these changes Jun 30, 2023

View reviewed changes

Address additional review comment from @seshachalam-yv

44cc1ae

gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 30, 2023

gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 30, 2023

shreyas-s-rao added this to the v0.19.0 milestone Jul 3, 2023

shreyas-s-rao assigned unmarshall Jul 4, 2023

unmarshall reviewed Jul 5, 2023

View reviewed changes

aaronfern reviewed Jul 6, 2023

View reviewed changes

Address review comments from @unmarshall

9645733

gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 6, 2023

gardener-robot modified the milestones: v0.19.0, v0.20.0 Jul 6, 2023

gardener-robot added the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Jul 6, 2023

gardener-robot added the needs/rebase Needs git rebase label Aug 3, 2023

shreyas-s-rao removed this from the v0.20.0 milestone Aug 11, 2023

gardener-robot closed this Nov 29, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 29, 2023

shreyas-s-rao deleted the enhance/backup-ready-conds branch November 29, 2023 13:15

This was referenced Nov 29, 2023

☂️ [DEP-04] Implement EtcdMember Custom Resource #702

Open

[Enhancement] Improve BackupReady conditions #618

Open

shreyas-s-rao assigned shreyas-s-rao and unassigned unmarshall Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance backup ready conditions #619

Enhance backup ready conditions #619

shreyas-s-rao commented Jun 19, 2023 •

edited

Loading

shreyas-s-rao commented Jun 22, 2023

gardener-robot commented Jun 25, 2023

shreyas-s-rao commented Jun 28, 2023

seshachalam-yv left a comment •

edited

Loading

shreyas-s-rao commented Jun 30, 2023

shreyas-s-rao commented Jun 30, 2023

unmarshall Jul 5, 2023

shreyas-s-rao Jul 6, 2023

unmarshall Jul 5, 2023

unmarshall Jul 5, 2023

unmarshall Jul 5, 2023

unmarshall Jul 5, 2023

shreyas-s-rao Jul 6, 2023

shreyas-s-rao Jul 6, 2023

shreyas-s-rao Jul 6, 2023

shreyas-s-rao Jul 6, 2023

unmarshall Jul 5, 2023

unmarshall Jul 5, 2023

aaronfern Jul 6, 2023

unmarshall Jul 6, 2023

shreyas-s-rao Jul 6, 2023

shreyas-s-rao commented Jul 6, 2023

gardener-prow bot commented Jul 31, 2023

gardener-robot commented Aug 3, 2023

shreyas-s-rao commented Nov 29, 2023

		// even though the full snapshot may have succeeded within the required time, we must still wait
		// for delta snapshotting to begin to consider the backups as healthy, to maintain the given RPO.

Enhance backup ready conditions #619

Enhance backup ready conditions #619

Conversation

shreyas-s-rao commented Jun 19, 2023 • edited Loading

shreyas-s-rao commented Jun 22, 2023

gardener-robot commented Jun 25, 2023

shreyas-s-rao commented Jun 28, 2023

seshachalam-yv left a comment • edited Loading

Choose a reason for hiding this comment

shreyas-s-rao commented Jun 30, 2023

shreyas-s-rao commented Jun 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shreyas-s-rao commented Jul 6, 2023

gardener-prow bot commented Jul 31, 2023

gardener-robot commented Aug 3, 2023

shreyas-s-rao commented Nov 29, 2023

shreyas-s-rao commented Jun 19, 2023 •

edited

Loading

seshachalam-yv left a comment •

edited

Loading