Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance backup ready conditions #619

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ install: manifests
kubectl apply -f config/crd/bases

# Deploy controller in the configured Kubernetes cluster in ~/.kube/config
.PHONY: deploy
.PHONY: deploy-via-kustomize
shreyas-s-rao marked this conversation as resolved.
Show resolved Hide resolved
deploy-via-kustomize: manifests $(KUSTOMIZE)
kubectl apply -f config/crd/bases
kustomize build config/default | kubectl apply -f -
Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ require (
github.com/onsi/ginkgo/v2 v2.6.1
github.com/onsi/gomega v1.24.2
github.com/prometheus/client_golang v1.14.0
github.com/robfig/cron/v3 v3.0.1
go.uber.org/zap v1.24.0
golang.org/x/exp v0.0.0-20230213192124-5e25df0256eb
gopkg.in/yaml.v2 v2.4.0
Expand Down Expand Up @@ -92,7 +93,6 @@ require (
github.com/prometheus/client_model v0.3.0 // indirect
github.com/prometheus/common v0.37.0 // indirect
github.com/prometheus/procfs v0.8.0 // indirect
github.com/robfig/cron/v3 v3.0.1 // indirect
github.com/russross/blackfriday/v2 v2.1.0 // indirect
github.com/sirupsen/logrus v1.8.1 // indirect
github.com/spf13/afero v1.8.2 // indirect
Expand Down
4 changes: 2 additions & 2 deletions pkg/health/condition/builder.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// skipMergeConditions contain the list of conditions we dont want to add to the list if not recalculated
// skipMergeConditions contain the list of conditions we don't want to add to the list if not recalculated
var skipMergeConditions = map[druidv1alpha1.ConditionType]struct{}{
druidv1alpha1.ConditionTypeReady: {},
druidv1alpha1.ConditionTypeAllMembersReady: {},
Expand Down Expand Up @@ -111,7 +111,7 @@ func (b *defaultBuilder) Build(replicas int32) []druidv1alpha1.Condition {
if condition.Status == "" {
condition.Status = druidv1alpha1.ConditionUnknown
}
condition.Reason = ConditionNotChecked
condition.Reason = "CLusterScaledToZero"
condition.Message = "etcd cluster has been scaled down"
} else {
condition.Status = res.Status()
Expand Down
267 changes: 186 additions & 81 deletions pkg/health/condition/check_backup_ready.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"time"

druidv1alpha1 "github.com/gardener/etcd-druid/api/v1alpha1"
"github.com/gardener/etcd-druid/pkg/utils"
coordinationv1 "k8s.io/api/coordination/v1"

"k8s.io/apimachinery/pkg/types"
Expand All @@ -37,101 +38,205 @@ const (
BackupFailed string = "BackupFailed"
// Unknown is a constant that means that the etcd backup status is currently not known
Unknown string = "Unknown"
// ConditionNotChecked is a constant that means that the etcd backup status has not been updated or rechecked
ConditionNotChecked string = "ConditionNotChecked"
)

func (a *backupReadyCheck) Check(ctx context.Context, etcd druidv1alpha1.Etcd) Result {
//Default case
result := &result{
conType: druidv1alpha1.ConditionTypeBackupReady,
status: druidv1alpha1.ConditionUnknown,
reason: Unknown,
message: "Cannot determine etcd backup status",
}

// Special case of etcd not being configured to take snapshots
// Do not add the BackupReady condition if backup is not configured
if etcd.Spec.Backup.Store == nil || etcd.Spec.Backup.Store.Provider == nil || len(*etcd.Spec.Backup.Store.Provider) == 0 {
return nil
}

//Fetch snapshot leases
var (
fullSnapErr, incrSnapErr error
fullSnapLease = &coordinationv1.Lease{}
deltaSnapLease = &coordinationv1.Lease{}
// Fetch snapshot leases
fullSnapshotLease, fullSnapshotLeaseErr := a.fetchLease(ctx, etcd.GetFullSnapshotLeaseName(), etcd.Namespace)
deltaSnapSnapshotLease, deltaSnapshotLeaseErr := a.fetchLease(ctx, etcd.GetDeltaSnapshotLeaseName(), etcd.Namespace)

if fullSnapshotLeaseErr != nil && deltaSnapshotLeaseErr != nil {
return createBackupConditionResult(
druidv1alpha1.ConditionUnknown, Unknown,
fmt.Sprintf("Unable to fetch both delta snap leases. %s\n%s", fullSnapshotLeaseErr.Error(), deltaSnapshotLeaseErr.Error()),
)
}
if fullSnapshotLeaseErr != nil {
return createBackupConditionResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real need for createBackupConditionResult function. You do not save on the number of lines of code, in-fact you have more lines of code :) and there is no readability improvement over just creating an instance of a struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conType: druidv1alpha1.ConditionTypeBackupReady is common to all backupReady condition result, so it made sense to pull it out into a separate function, just to avoid adding the conType every single time when returning. Of course, the previous method was to create a default result at the beginning of the function and then simply change the values when returning, but @seshachalam-yv pointed out that it was not the most readable, since one has to check the default result as well as the changed values to figure out the final result being returned.

druidv1alpha1.ConditionUnknown, Unknown,
fmt.Sprintf("Unable to fetch full snap lease. %s", fullSnapshotLeaseErr.Error()),
)
}

if deltaSnapshotLeaseErr != nil {
return createBackupConditionResult(
druidv1alpha1.ConditionUnknown, Unknown,
fmt.Sprintf("Unable to fetch delta snap lease. %s", deltaSnapshotLeaseErr.Error()),
)
}

deltaSnapshotLeaseRenewTime := deltaSnapSnapshotLease.Spec.RenewTime
fullSnapshotLeaseRenewTime := fullSnapshotLease.Spec.RenewTime
fullSnapshotLeaseCreationTime := &fullSnapshotLease.ObjectMeta.CreationTimestamp
fullSnapshotDuration, err := utils.ComputeScheduleDuration(*etcd.Spec.Backup.FullSnapshotSchedule)
if err != nil {
return createBackupConditionResult(
druidv1alpha1.ConditionUnknown, Unknown,
fmt.Sprintf("Unable to compute full snapshot duration from schedule. %v", err.Error()),
)
}

// Both snapshot leases are not yet renewed
if fullSnapshotLeaseRenewTime == nil && deltaSnapshotLeaseRenewTime == nil {
return createBackupConditionResult(
druidv1alpha1.ConditionUnknown, Unknown,
"Snapshotter has not started yet",
)
}

// Both snap leases are renewed, ie, maintained. Both are expected to be renewed periodically
if fullSnapshotLeaseRenewTime != nil && deltaSnapshotLeaseRenewTime != nil {
return handleRenewedSnapshotLeases(
fullSnapshotLease.Spec.RenewTime.Time, fullSnapshotDuration,
deltaSnapSnapshotLease.Spec.RenewTime.Time, 2*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration,
)
}

// Full snapshot lease is renewed, while delta snapshot lease is not.
// Most probable during a startup of a new cluster and only full snapshot has been taken
if fullSnapshotLeaseRenewTime != nil && deltaSnapshotLeaseRenewTime == nil {
return handleOnlyFullSnapshotLeaseRenewal(
fullSnapshotLease.Spec.RenewTime.Time,
5*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have create functions for delta and full snapshot grace period? These can be methods on etcd resource itself, if you do not like that then these can just be standalone functions. One single place to compute these preventing duplicating these multipliers in more than one place in future code changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you notice, the grace period isn't same across functions. Grace period for renewing delta snapshot lease that was already renewed previously is 2x delta snapshot period, while grace period for delta snapshotting to start once full snapshot has been taken is 5x delta snapshot period. So it's not achievable by one single function. Also, this is condition-specific information, and may not make sense to be part of the API, because the API has no knowledge of which conditions are set on the etcd status by druid, and it shouldn't as well.

)
}

// Delta snapshot lease is renewed, while full snapshot lease is not.
// Most probable during reconcile of existing clusters if fresh leases are created
return handleOnlyDeltaSnapshotLeaseRenewal(
fullSnapshotLeaseCreationTime.Time, fullSnapshotDuration,
deltaSnapshotLeaseRenewTime.Time, 2*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration,
)
}

func (a *backupReadyCheck) fetchLease(ctx context.Context, name string, namespace string) (*coordinationv1.Lease, error) {
lease := &coordinationv1.Lease{}
err := a.cl.Get(ctx, types.NamespacedName{Name: name, Namespace: namespace}, lease)
return lease, err
}

func createBackupConditionResult(status druidv1alpha1.ConditionStatus, reason string, message string) *result {
return &result{
conType: druidv1alpha1.ConditionTypeBackupReady,
status: status,
reason: reason,
message: message,
}
}

func isLeaseStale(renewTime time.Time, renewalGracePeriod time.Duration) bool {
return time.Since(renewTime) > renewalGracePeriod
}

func wasLeaseCreatedRecently(creationTime time.Time, creationGracePeriod time.Duration) bool {
return time.Since(creationTime) < creationGracePeriod
}

// handleRenewedSnapshotLeases checks whether full and delta snapshot leases
// have been renewed within the required times respectively.
func handleRenewedSnapshotLeases(fullSnapshotLeaseRenewTime time.Time, fullSnapshotLeaseRenewalGracePeriod time.Duration,
deltaSnapshotLeaseRenewTime time.Time, deltaSnapshotLeaseRenewalGracePeriod time.Duration) *result {
isFullSnapshotLeaseStale := isLeaseStale(fullSnapshotLeaseRenewTime, fullSnapshotLeaseRenewalGracePeriod)
isDeltaSnapshotLeaseStale := isLeaseStale(deltaSnapshotLeaseRenewTime, deltaSnapshotLeaseRenewalGracePeriod)

if isFullSnapshotLeaseStale && !isDeltaSnapshotLeaseStale {
return createBackupConditionResult(
druidv1alpha1.ConditionFalse, BackupFailed,
fmt.Sprintf("Stale full snapshot lease. Not renewed for %v", fullSnapshotLeaseRenewalGracePeriod),
)
}

if !isFullSnapshotLeaseStale && isDeltaSnapshotLeaseStale {
return createBackupConditionResult(
druidv1alpha1.ConditionFalse, BackupFailed,
fmt.Sprintf("Stale delta snapshot lease. Not renewed for %v", deltaSnapshotLeaseRenewalGracePeriod),
)
}

if isFullSnapshotLeaseStale && isDeltaSnapshotLeaseStale {
return createBackupConditionResult(
druidv1alpha1.ConditionFalse, BackupFailed,
fmt.Sprintf("Stale snapshot leases. Full snapshot lease not renewed for %v and delta snapshot lease not renewed for %v",
fullSnapshotLeaseRenewalGracePeriod, deltaSnapshotLeaseRenewalGracePeriod),
)
}

return createBackupConditionResult(
druidv1alpha1.ConditionTrue, BackupSucceeded,
"Snapshot backup succeeded",
)
fullSnapErr = a.cl.Get(ctx, types.NamespacedName{Name: getFullSnapLeaseName(&etcd), Namespace: etcd.ObjectMeta.Namespace}, fullSnapLease)
incrSnapErr = a.cl.Get(ctx, types.NamespacedName{Name: getDeltaSnapLeaseName(&etcd), Namespace: etcd.ObjectMeta.Namespace}, deltaSnapLease)

//Set status to Unknown if errors in fetching snapshot leases or lease never renewed
if fullSnapErr != nil || incrSnapErr != nil || (fullSnapLease.Spec.RenewTime == nil && deltaSnapLease.Spec.RenewTime == nil) {
return result
}

deltaLeaseRenewTime := deltaSnapLease.Spec.RenewTime
fullLeaseRenewTime := fullSnapLease.Spec.RenewTime
fullLeaseCreateTime := &fullSnapLease.ObjectMeta.CreationTimestamp

if fullLeaseRenewTime == nil && deltaLeaseRenewTime != nil {
// Most probable during reconcile of existing clusters if fresh leases are created
// Treat backup as succeeded if delta snap lease renewal happens in the required time window and full snap lease is not older than 24h.
if time.Since(deltaLeaseRenewTime.Time) < 2*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration && time.Since(fullLeaseCreateTime.Time) < 24*time.Hour {
result.reason = BackupSucceeded
result.message = "Delta snapshot backup succeeded"
result.status = druidv1alpha1.ConditionTrue
return result
}
} else if deltaLeaseRenewTime == nil && fullLeaseRenewTime != nil {
//Most probable during a startup scenario for new clusters
//Special case. Return Unknown condition for some time to allow delta backups to start up
if time.Since(fullLeaseRenewTime.Time) > 5*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration {
result.message = "Periodic delta snapshots not started yet"
return result
}
} else if deltaLeaseRenewTime != nil && fullLeaseRenewTime != nil {
//Both snap leases are maintained. Both are expected to be renewed periodically
if time.Since(deltaLeaseRenewTime.Time) < 2*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration && time.Since(fullLeaseRenewTime.Time) < 24*time.Hour {
result.reason = BackupSucceeded
result.message = "Snapshot backup succeeded"
result.status = druidv1alpha1.ConditionTrue
return result
}
}

//Cases where snapshot leases are not updated for a long time
//If snapshot leases are present and leases aren't updated, it is safe to assume that backup is not healthy

if etcd.Status.Conditions != nil {
var prevBackupReadyStatus druidv1alpha1.Condition
for _, prevBackupReadyStatus = range etcd.Status.Conditions {
if prevBackupReadyStatus.Type == druidv1alpha1.ConditionTypeBackupReady {
break
}
}

// Transition to "False" state only if present state is "Unknown" or "False"
if deltaLeaseRenewTime != nil && (prevBackupReadyStatus.Status == druidv1alpha1.ConditionUnknown || prevBackupReadyStatus.Status == druidv1alpha1.ConditionFalse) {
if time.Since(deltaLeaseRenewTime.Time) > 3*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration {
result.status = druidv1alpha1.ConditionFalse
result.reason = BackupFailed
result.message = "Stale snapshot leases. Not renewed in a long time"
return result
}
}
}

//Transition to "Unknown" state is we cannot prove a "True" state
return result
}

func getDeltaSnapLeaseName(etcd *druidv1alpha1.Etcd) string {
return fmt.Sprintf("%s-delta-snap", string(etcd.ObjectMeta.Name))
// handleOnlyFullSnapshotLeaseRenewal handles cases where snapshotter has just started,
// so only full snapshot has been taken, while delta snapshot is still not taken.
// Returns `Unknown` condition for some time to allow delta snapshotting to begin, because
// even though the full snapshot may have succeeded within the required time, we must still wait
// for delta snapshotting to begin to consider the backups as healthy, because full snapshot is old now.
func handleOnlyFullSnapshotLeaseRenewal(fullSnapshotLeaseRenewTime time.Time, deltaSnapshotRenewalGracePeriod time.Duration) *result {
if time.Since(fullSnapshotLeaseRenewTime) > deltaSnapshotRenewalGracePeriod {
shreyas-s-rao marked this conversation as resolved.
Show resolved Hide resolved
return createBackupConditionResult(
druidv1alpha1.ConditionFalse, BackupFailed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely convinced if this results in BackupFailed. Can you semantically define this term to better gauge if this is the correct reason code? The reason is that you still have a full-snapshot that has been successfully backed-up. Should you call this BackupFailed or DeltaSnapshotBackupFailed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That again now depends on how granular we want our Reason codes to be. BackupFailed denotes that either of the backups failed, and either case is dangerous. Without a full snapshot taken on time, we risk longer restoration time and hence a longer RTO. And without a delta snapshot taken on time, we violate SLAs since RPOs is affected. So differentiating between which snapshot failed in the Reason does not provide much benefit to operators. If they want more info on why BackupFailed was set, they can look into the Message.

fmt.Sprintf("Delta snapshot backup failed. Delta snapshot lease not renewed for %v", deltaSnapshotRenewalGracePeriod),
)
}

return createBackupConditionResult(
druidv1alpha1.ConditionUnknown, Unknown,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear on why the reason code should be Unknown. Also using the same condition for full and delta is quite confusing to correctly determine the status of the overall backup. If we now see this condition in the etcd status.Conditions then one must always remember that this condition means that a full snapshot was indeed taken and we are waiting for delta snapshot backups to be taken. This is not clear from this condition as we have decided to overload one condition for 2 different snapshots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that something like PartiallyUnknown sounds better than Unknown. But it simply denotes that there is an "unknown" in the cluster that druid does not know about, hence it sets this reason. It provides more detailed information in the message.

"Waiting for delta snapshotting to begin",
)
}

func getFullSnapLeaseName(etcd *druidv1alpha1.Etcd) string {
return fmt.Sprintf("%s-full-snap", string(etcd.ObjectMeta.Name))
// handleOnlyDeltaSnapshotLeaseRenewal handles cases where only delta snapshot lease is renewed,
// while the full snapshot lease is not renewed, possibly since it might have been recreated recently.
// Treats backup as succeeded if delta snapshot lease is renewed within the required time window
// and full snapshot lease object is not older than the computed full snapshot duration.
func handleOnlyDeltaSnapshotLeaseRenewal(fullSnapshotLeaseCreationTime time.Time, fullSnapshotLeaseCreationGracePeriod time.Duration,
deltaSnapshotLeaseRenewTime time.Time, deltaSnapshotLeaseRenewalGracePeriod time.Duration) *result {
// wasFullSnapshotLeaseCreatedRecently indicates whether full snapshot was created within the given grace period. If it was,
// then it can be renewed only upon the next scheduled full snapshot event, so until then we cannot assume the status of the last
// full snapshot. But if it was created before the grace period and not been renewed, then full snapshot lease can be considered stale.
wasFullSnapshotLeaseCreatedRecently := wasLeaseCreatedRecently(fullSnapshotLeaseCreationTime, fullSnapshotLeaseCreationGracePeriod)

// isDeltaSnapshotLeaseStale indicates whether the delta snapshot lease has not been renewed within the given grace period.
isDeltaSnapshotLeaseStale := isLeaseStale(deltaSnapshotLeaseRenewTime, deltaSnapshotLeaseRenewalGracePeriod)

// Delta snapshot lease is stale, while staleness of full snapshot lease cannot be determined yet
if isDeltaSnapshotLeaseStale && wasFullSnapshotLeaseCreatedRecently {
return createBackupConditionResult(
druidv1alpha1.ConditionFalse, BackupFailed,
fmt.Sprintf("Delta snapshot backup failed. Delta snapshot lease not renewed for %v", deltaSnapshotLeaseRenewalGracePeriod),
)
}

// Both delta and full snapshot leases are stale
if isDeltaSnapshotLeaseStale && !wasFullSnapshotLeaseCreatedRecently {
return createBackupConditionResult(
druidv1alpha1.ConditionFalse, BackupFailed,
fmt.Sprintf("Stale snapshot leases. Full snapshot lease not renewed for %v and delta snapshot lease not renewed for %v",
fullSnapshotLeaseCreationGracePeriod, deltaSnapshotLeaseRenewalGracePeriod),
)
}

// Delta snapshot lease is not stale, while full snapshot lease is
if !wasFullSnapshotLeaseCreatedRecently {
return createBackupConditionResult(
druidv1alpha1.ConditionFalse, BackupFailed,
"Full snapshot backup failed. Full snapshot lease created long ago, but not renewed",
)
}

// Delta snapshot lease is not stale, while staleness of full snapshot lease cannot be determined yet,
// hence we explicitly mention success of delta snapshot but do not mention full snapshot.
return createBackupConditionResult(
druidv1alpha1.ConditionTrue, BackupSucceeded,
"Delta snapshot backup succeeded",
)
}

// BackupReadyCheck returns a check for the "BackupReady" condition.
Expand Down
Loading