Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 153: Add Rollback support for Pravega Cluster #255

Merged
merged 18 commits into from
Sep 20, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 184 additions & 0 deletions doc/rollback-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Pravega Cluster Rollback

This document details how manual rollback can be triggered after a Pravega cluster upgrade fails.
Note that a rollback can be triggered only on Upgrade Failure.

## Upgrade Failure

An Upgrade can fail because of following reasons:

1. Incorrect configuration (wrong quota, permissions, limit ranges)
2. Network issues (ImagePullError)
3. K8s Cluster Issues.
4. Application issues (Application runtime misconfiguration or code bugs)

An upgrade failure can manifest through a Pod staying in `Pending` state forever or continuously restarting or crashing (CrashLoopBackOff).
A component deployment failure needs to be tracked and mapped to "Upgrade Failure" for Pravega Cluster.
Here we try to fail-fast by explicitly checking for some common causes for deployment failure like image pull errors or CrashLoopBackOff State and failing the upgrade if any pod runs into this state during upgrade.

The following Pravega Cluster Status Condition indicates a Failed Upgrade:

```
ClusterConditionType: Error
Status: True
Reason: UpgradeFailed
Message: <Details of exception/cause of failure>
```
After an Upgrade Failure the output of `kubectl describe pravegacluster pravega` would look like this:

```
$> kubectl describe pravegacluster pravega
. . .
Spec:
. . .
Version: 0.6.0-2252.b6f6512
. . .
Status:
. . .
Conditions:
Last Transition Time: 2019-09-06T09:00:13Z
Last Update Time: 2019-09-06T09:00:13Z
Status: False
Type: Upgrading
Last Transition Time: 2019-09-06T08:58:40Z
Last Update Time: 2019-09-06T08:58:40Z
Status: False
Type: PodsReady
Last Transition Time: 2019-09-06T09:00:13Z
Last Update Time: 2019-09-06T09:00:13Z
Message: failed to sync segmentstore version. pod pravega-pravega-segmentstore-0 update failed because of ImagePullBackOff
Reason: UpgradeFailed
Status: True
Type: Error
. . .
Current Version: 0.6.0-2239.6e24df7
. . .
Version History:
0.6.0-2239.6e24df7
```
where `0.6.0-2252.b6f6512` is the version we tried upgrading to and `0.6.0-2239.6e24df7` is the cluster version prior to triggering the upgrade.

## Manual Rollback Trigger

A Rollback is triggered when a Pravega Cluster is in `UpgradeFailed` Error State and a user manually updates version feild in the PravegaCluster spec to point to the last stable cluster version.

A Rollback involves moving all components in the cluster back to the last stable cluster version. As with upgrades, the operator rolls back one component at a time and one pod at a time to preserve high-availability.

Note:
1. A Rollback to only the last stable cluster version is supported at this point.
2. Changing the cluster spec version to the previous cluster version, when cluster is not in `UpgradeFailed` state, will not trigger a rollback.

## Rollback Implementation

When Rollback is triggered the cluster moves into ClusterCondition `RollbackInProgress`.
Once the Rollback completes, this condition is set to false.

During a Rollback, the Cluster Status should look something like:
```
$> kubectl describe pravegacluster pravega
. . .
Status:
Conditions:
Last Transition Time: 2019-09-20T10:41:10Z
Last Update Time: 2019-09-20T10:41:10Z
Status: False
Type: Upgrading
Last Transition Time: 2019-09-20T10:45:12Z
Last Update Time: 2019-09-20T10:45:12Z
Status: True
Type: PodsReady
Last Transition Time: 2019-09-20T10:41:10Z
Last Update Time: 2019-09-20T10:41:10Z
Message: failed to sync segmentstore version. pod pravega-pravega-segmentstore-0 update failed because of ImagePullBackOff
Reason: UpgradeFailed
Status: True
Type: Error
Last Update Time: 2019-09-20T10:45:12Z
Message: 1
Reason: Updating Bookkeeper
Status: True
Type: RollbackInProgress
. . .
```
Here the `RollbackInProgress` condition being `true` indicates that a Rollback is in Progress.
Also `Reason` and `Message` feilds of this condition indicate the component being rolled back and number of updated replicas respectively.

The operator rolls back components following the reverse upgrade order :

1. Pravega Controller
2. Pravega Segment Store
3. BookKeeper

A `versionHistory` field in the PravegaClusterSpec maintains the history of upgrades.

## Rollback Outcome

### Success
If the Rollback completes successfully, the cluster state goes back to condition `PodsReady`, which would mean the cluster is now in a stable state. All other conditions should be `false`.
```
Last Transition Time: 2019-09-20T09:49:26Z
Last Update Time: 2019-09-20T09:49:26Z
Status: True
Type: PodsReady

```

Example:
```
Status:
Conditions:
Last Transition Time: 2019-09-20T10:12:04Z
Last Update Time: 2019-09-20T10:12:04Z
Status: False
Type: Upgrading
Last Transition Time: 2019-09-20T10:11:34Z
Last Update Time: 2019-09-20T10:11:34Z
Status: True
Type: PodsReady
Last Transition Time: 2019-09-20T10:07:19Z
Last Update Time: 2019-09-20T10:07:19Z
Status: False
Type: Error
Last Transition Time: 2019-09-20T09:50:57Z
Last Update Time: 2019-09-20T09:50:57Z
Status: False
Type: RollbackInProgress
```

### Failure
If the Rollback Fails, the cluster would move to `Error` state indicated by this cluster condition:
```
ClusterConditionType: Error
Status: True
Reason: RollbackFailed
Message: <Details of exception/cause of failure>
```

Example:
```
Status:
Conditions:
Last Transition Time: 2019-09-20T09:46:24Z
Last Update Time: 2019-09-20T09:46:24Z
Status: False
Type: Upgrading
Last Transition Time: 2019-09-20T09:49:26Z
Last Update Time: 2019-09-20T09:49:26Z
Status: False
Type: PodsReady
Last Transition Time: 2019-09-20T09:46:24Z
Last Update Time: 2019-09-20T09:50:57Z
Message: failed to sync bookkeeper version. pod pravega-bookie-0 update failed because of ImagePullBackOff
Reason: RollbackFailed
Status: True
Type: Error
Last Transition Time: 2019-09-20T09:50:57Z
Last Update Time: 2019-09-20T09:50:57Z
Status: False
Type: RollbackInProgress
```

When a rollback failure happens, manual intervention would be required to resolve this.
After checking and solving the root cause of failure, to bring the cluster back to a stable state, a user can upgrade to:
1. The version to which a user initially intended to upgrade.(when upgrade failure was noticed)
2. To any other supported version based versions of all pods in the cluster.
47 changes: 40 additions & 7 deletions doc/upgrade-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@ Check out [Pravega documentation](http://pravega.io/docs/latest/) for more infor

## Pending tasks

- The rollback mechanism is on the roadmap but not implemented yet. Check out [this issue](https://github.com/pravega/pravega-operator/issues/153).
- Manual recovery from an upgrade is possible but it has not been defined yet. Check out [this issue](https://github.com/pravega/pravega-operator/issues/157).
- There is no validation of the configured desired version. Check out [this issue](https://github.com/pravega/pravega-operator/issues/156)


Expand All @@ -35,6 +33,19 @@ NAME VERSION DESIRED MEMBERS READY MEMBERS AGE
example 0.4.0 7 7 11m
```

## Upgrade Path Matrix

| BASE VERSION | TARGET VERSION |
| ------------ | ---------------- |
| 0.1.0 | 0.1.0 |
| 0.2.0 | 0.2.0 |
| 0.3.0 | 0.3.0, 0.3.1, 0.3.2|
| 0.3.1 | 0.3.1, 0.3.2 |
| 0.3.2 | 0.3.2 |
| 0.4.0 | 0.4.0 |
| 0.5.0 | 0.5.0, 0.6.0 |
| 0.6.0 | 0.6.0 |

## Trigger an upgrade

To initiate an upgrade process, a user has to update the `spec.version` field on the `PravegaCluster` custom resource. This can be done in three different ways using the `kubectl` command.
Expand Down Expand Up @@ -103,8 +114,7 @@ Segment Store instances need access to a persistent volume to store the cache. L

Also, Segment Store pods need to be individually accessed by clients, so having a stable network identifier provided by the Statefulset and a headless service is very convenient.

Same as Bookkeeper, we use `OnDelete` strategy for Segment Store. The reason that we don't use `RollingUpdate` strategy here is that we found it convenient to manage the upgrade
and rollback in the same fashion. Using `RollingUpdate` will introduce Kubernetes rollback mechanism which will cause trouble to our implementation.
Same as Bookkeeper, we use `OnDelete` strategy for Segment Store. The reason that we don't use `RollingUpdate` strategy here is that we found it convenient to manage the upgrade and rollback in the same fashion. Using `RollingUpdate` will introduce Kubernetes rollback mechanism which will cause trouble to our implementation.

### Pravega Controller upgrade

Expand All @@ -131,7 +141,30 @@ NAME VERSION DESIRED MEMBERS READY MEMBERS AGE
example 0.5.0 8 8 1h
```

If your upgrade has failed, you can describe the status section of your Pravega cluster to discover why.
The command `kubectl describe` can be used to track progress of the upgrade.
```
$ kubectl describe PravegaCluster example
...
Status:
Conditions:
Status: True
Type: Upgrading
Reason: Updating BookKeeper
Message: 1
Last Transition Time: 2019-04-01T19:42:37+02:00
Last Update Time: 2019-04-01T19:42:37+02:00
Status: False
Type: PodsReady
Last Transition Time: 2019-04-01T19:43:08+02:00
Last Update Time: 2019-04-01T19:43:08+02:00
Status: False
Type: Error
...

```
The `Reason` field in Upgrading Condition shows the component currently being upgraded and `Message` field reflects number of successfully upgraded replicas in this component.

If upgrade has failed, please check the `Status` section to understand the reason for failure.

```
$ kubectl describe PravegaCluster example
Expand Down Expand Up @@ -181,10 +214,10 @@ INFO[5899] Reconciling PravegaCluster default/example
INFO[5900] statefulset (example-bookie) status: 1 updated, 2 ready, 3 target
INFO[5929] Reconciling PravegaCluster default/example
INFO[5930] statefulset (example-bookie) status: 1 updated, 2 ready, 3 target
INFO[5930] error syncing cluster version, need manual intervention. failed to sync bookkeeper version. pod example-bookie-0 is restarting
INFO[5930] error syncing cluster version, upgrade failed. failed to sync bookkeeper version. pod example-bookie-0 is restarting
...
```

### Recovering from a failed upgrade

Not defined yet. Check [this issue](https://github.com/pravega/pravega-operator/issues/157) for tracking.
See [Rollback](rollback-cluster.md)
Loading