Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] upgrade/chain halt recovery #837

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Dockerfile.release
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@
FROM debian:bookworm
ARG TARGETARCH

# Install necessary packages.
RUN apt-get update && \
apt-get install -y --no-install-recommends ca-certificates && \
rm -rf /var/lib/apt/lists/*

# Use `1025` G/UID so users can switch between this and `heighliner` image without a need to chown the files.
RUN groupadd -g 1025 pocket && useradd -u 1025 -g pocket -m -s /sbin/nologin pocket

Expand Down
2 changes: 1 addition & 1 deletion api/poktroll/application/event.pulsar.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions app/upgrades.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import (
// so `cosmovisor` can automatically pull the binary from GitHub.
var allUpgrades = []upgrades.Upgrade{
upgrades.Upgrade_0_0_4,
upgrades.Upgrade_0_0_9,
}

// setUpgrades sets upgrade handlers for all upgrades and executes KVStore migration if an upgrade plan file exists.
Expand Down
9 changes: 9 additions & 0 deletions app/upgrades/historical.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ package upgrades

import (
"context"
"fmt"

storetypes "cosmossdk.io/store/types"
upgradetypes "cosmossdk.io/x/upgrade/types"
Expand All @@ -29,6 +30,7 @@ func defaultUpgradeHandler(
configurator module.Configurator,
) upgradetypes.UpgradeHandler {
return func(ctx context.Context, plan upgradetypes.Plan, vm module.VersionMap) (module.VersionMap, error) {
fmt.Println("Starting the migration in defaultUpgradeHandler.")
return mm.RunMigrations(ctx, configurator, vm)
}
}
Expand Down Expand Up @@ -87,3 +89,10 @@ var Upgrade_0_0_4 = Upgrade{
// No changes to the KVStore in this upgrade.
StoreUpgrades: storetypes.StoreUpgrades{},
}

// Upgrade_0_0_9 is a small upgrade on TestNet.
var Upgrade_0_0_9 = Upgrade{
PlanName: "v0.0.9",
CreateUpgradeHandler: defaultUpgradeHandler,
StoreUpgrades: storetypes.StoreUpgrades{},
}
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,15 @@ title: Chain Halt Troubleshooting
- [Understanding Chain Halts](#understanding-chain-halts)
- [Definition and Causes](#definition-and-causes)
- [Impact on Network](#impact-on-network)
- [Troubleshooting Process](#troubleshooting-process)
- [Troubleshooting `wrong Block.Header.AppHash`](#troubleshooting-wrong-blockheaderapphash)
- [Step 1: Identifying the Issue](#step-1-identifying-the-issue)
- [Step 2: Collecting Node Data](#step-2-collecting-node-data)
- [Step 3: Analyzing Discrepancies](#step-3-analyzing-discrepancies)
- [Step 4: Decoding and Interpreting Data](#step-4-decoding-and-interpreting-data)
- [Step 5: Comparing Records](#step-5-comparing-records)
- [Step 6: Investigation and Resolution](#step-6-investigation-and-resolution)
- [Troubleshooting `wrong Block.Header.LastResultsHash`](#troubleshooting-wrong-blockheaderlastresultshash)
- [Syncing from genesis](#syncing-from-genesis)

## Understanding Chain Halts

Expand All @@ -40,7 +42,7 @@ Chain halts can have severe consequences for the network:

Given these impacts, swift and effective troubleshooting is crucial to maintain network health and user trust.

## Troubleshooting Process
## Troubleshooting `wrong Block.Header.AppHash`

### Step 1: Identifying the Issue

Expand Down Expand Up @@ -94,3 +96,20 @@ Based on the identified discrepancies:
2. Develop a fix or patch to address the issue.
3. If necessary, initiate discussions with the validator community to reach social consensus on how to proceed.
4. Implement the agreed-upon solution and monitor the network closely during and after the fix.

## Troubleshooting `wrong Block.Header.LastResultsHash`

Errors like the following can occur from using the incorrect binary version at a certain height.

```bash
reactor validation error: wrong Block.Header.LastResultsHash.
```

The solution is to use the correct binary version to sync the full node at the correct height.

Tools like [cosmosvisor](https://docs.cosmos.network/v0.45/run-node/cosmovisor.html) make it easier
to sync a node from genesis, using the appropriate binary for each range of block heights.

## Syncing from genesis

If you're encountering any of the errors mentioned above while trying to sync the historical blocks - make sure you're running the correct version of the binary in accordance with this table [Upgrade List](../../protocol/upgrades/upgrade_list.md).
109 changes: 109 additions & 0 deletions docusaurus/docs/develop/developer_guide/recovery_from_chain_halt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
sidebar_position: 7
title: Chain Halt Recovery
---

## Chain Halt Recovery <!-- omit in toc -->

This document describes how to recover from a chain halt. It assumes the cause of
the chain halt has been identified, the new release has been created, and verified
function correctly.

:::tip
See [Chain Halt Troubleshooting](./chain_halt_troubleshooting.md) for more information on identifying the cause of a chain halt.
:::

- [Background](#background)
- [Resolving halts during a network upgrade](#resolving-halts-during-a-network-upgrade)
- [Manual binary replacement (preferred)](#manual-binary-replacement-preferred)
- [Rollback, fork and upgrade](#rollback-fork-and-upgrade)
- [Step 5: Data rollback - retrieving snapshot at a specific height](#step-5-data-rollback---retrieving-snapshot-at-a-specific-height)
- [Step 6: Validator Isolation - risk mitigation](#step-6-validator-isolation---risk-mitigation)

## Background

Pocket network is built on top of `cosmos-sdk`, which utilizes the CometBFT consensus engine.
Byzantine Fault Tolerant (BFT) consensus algorithm requires that **at least** 2/3 of Validators
are online and voting for the same block to reach a consensus. In order to maintain liveness
and avoid a chain-halt, we need the majority (> 2/3) of Validators to participate
and use the same version of the software.

## Resolving halts during a network upgrade

If the halt is caused by the network upgrade, it is possible the solution can be as simple as
skipping an upgrade (i.e. `unsafe-skip-upgrade`) and creating a new (fixed) upgrade.

Read more about [upgrade contingency plans](../../protocol/upgrades/contigency_plans.md).

### Manual binary replacement (preferred)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this section w/ links to the binaries for easier access.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no binaries to point to. Rephrased.


:::note

This is preferred way of resolving the consensus-breaking issues.

:::

Since the chain is not moving, **it is impossible** to issue an automatic upgrade with an upgrade plan.

Instead, we need **social consensus** to manually replace the binary and get the chain moving.

Currently this involves synching the network from genesis breaking a way to sync the network from genesis without human interaction, but there are some plans to make the process less painful in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what you're trying to say with this sentence. #PUC

Copy link
Member Author

@okdas okdas Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL that sentence makes no sense. The original was more comprehensible.


<!-- TODO_IMPROVE(@okdas): add links to Cosmovisor documentation how the new UX can be used to automate syncing from genesis without human input. -->

### Rollback, fork and upgrade
okdas marked this conversation as resolved.
Show resolved Hide resolved

:::info

These instructions are only relevant to Pocket Network's Shannon release.

We do not currently use `x/gov` and on-chain voting for upgrades.

Instead, our DAO votes on upgrades off-chain and the Foundation executes
transactions on their behalf.

:::

**Performing a rollback is analogous to forking the network at the older height.**

This should be avoided unless absolutely necessary.

However, if necessary, the instructions to follow are:

1. Prepare & verify a new binary that addresses the consensus-breaking issue.
2. [Create a release](../../protocol/upgrades/release_process.md).
3. [Prepare an upgrade transaction](../../protocol/upgrades/upgrade_procedure.md#writing-an-upgrade-transaction) to the new version.
4. Get the Validator set off the network **3 blocks** prior to the height of the chain halt. For example:
- Assume an issue at height `103`
- Get the validator set at height `100`
- Submit an upgrade transaction at `101`
- Upgrade the chain at height `102`
- Avoid the issue at height `103`
5. Ensure all validators rolled back to the same height and use the same snapshot
- The snapshot should be imported into each Validator's data directory
- This is necessary to ensure data continuity and prevent forks.
6. Isolate the validator set from full nodes.
- This is necessary to avoid full nodes from gossiping blocks that have been rolled back.
- This may require using a firewall or a private network
- Validators should only be gossip blocks amongst themselves.
7. Start the network and perform the upgrade. For example, reiterating the process above:
- Start all Validators at height `100`
- On block `101`, submit the `MsgSoftwareUpgrade` transaction with a `Plan.height` set to `102`.
- `x/upgrade` will perform the upgrade in the `EndBlocker` of block `102`
- If using `cosmosvisor`, the node will wait to replace the binary
8. Wait for the network to reach the height of the previous ledger (`104`+)
9. Allow validators to open their network to full nodes again.
- Note that full nodes will need to perform the rollback or use a snapshot as well.

#### Step 5: Data rollback - retrieving snapshot at a specific height

There are two ways to get a snapshot from a prior height:

1. Use `poktrolld rollback --hard` repeately until the command responds with the desired block number.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a ```bash block so it's easier to copy paste

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, but honestly, I prefer one ` here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change it back if need be.

You know how they say "the customer is always right"?

I'm the customer of this document.

2. Use a snapshot and start the node with `--halt-height=100` parameter so it only syncs up to certain height and then gracefully shuts down.
okdas marked this conversation as resolved.
Show resolved Hide resolved

#### Step 6: Validator Isolation - risk mitigation

- Having at least one node that has knowledge of the forking ledger can jeopardize the whole process. In particular, the following errors are the sign of the nodes populating existing blocks:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copy-pasted this here so the section above is easier to read (i.e. less cognitive overhead). Please clean up

- `found conflicting vote from ourselves; did you unsafe_reset a validator?`
- `conflicting votes from validator`
49 changes: 49 additions & 0 deletions docusaurus/docs/protocol/upgrades/contigency_plans.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
title: Failed upgrade contingency plan
sidebar_position: 5
---

# Contingency plans <!-- omit in toc -->

There's always a chance the upgrade will fail. We have prepared some contingency plans, so we can recover without significant downtime.

:::tip

This documentation covers failed upgrade contingency for `poktroll` - a `cosmos-sdk` based chain. While this can be helpful for other blockchain networks, it is not guaranteed to work for other chains.

:::

- [Option 0: The bug is discovered before the upgrade height is reached](#option-0-the-bug-is-discovered-before-the-upgrade-height-is-reached)
- [Option 1: The upgrade height is reached and the migration didn't start](#option-1-the-upgrade-height-is-reached-and-the-migration-didnt-start)
- [Option 2: The migration is stuck](#option-2-the-migration-is-stuck)
- [Option 3: The network is stuck at the future height after the upgrade](#option-3-the-network-is-stuck-at-the-future-height-after-the-upgrade)


## Option 0: The bug is discovered before the upgrade height is reached

Cancel the upgrade plan: [how](./upgrade_procedure.md#cancelling-the-upgrade-plan).

Check warning on line 24 in docusaurus/docs/protocol/upgrades/contigency_plans.md

View workflow job for this annotation

GitHub Actions / misspell

[misspell] docusaurus/docs/protocol/upgrades/contigency_plans.md#L24

"cancelling" is a misspelling of "canceling"
Raw output
./docusaurus/docs/protocol/upgrades/contigency_plans.md:24:54: "cancelling" is a misspelling of "canceling"

## Option 1: The upgrade height is reached and the migration didn't start

If the nodes on the network stopped at the upgrade height and the migration did not start yet (there are no logs indicating the upgrade handler and store migrations are being executed), we should gather a social consensus to restart validators with the `--unsafe-skip-upgrade=$upgradeHeightNumber` flag. This will skip the upgrade process, allowing the chain to continue and the protocol team to plan another release.

`--unsafe-skip-upgrade` simply skips the upgrade handler and store migrations, and the chain continues as if the upgrade plan was never set. The upgrade needs to be fixed, and then a new plan needs to be submitted to the network.

:::caution
`--unsafe-skip-upgrade` needs to be documented and added to the scripts so the next time somebody tries to sync the network from genesis - they will automatically skip the failed upgrade.

<!-- TODO: new cosmovisor UX can simplify this -->
:::

## Option 2: The migration is stuck

If the migration is stuck, there's always a chance the state has been mutated for the upgrade but the migration didn't complete. In such a case, we need to:

- Roll back validators to the backup (a snapshot is taken by `cosmovisor` automatically prior to upgrade, if `UNSAFE_SKIP_BACKUP` is set to `false`).
- Skip the upgrade handler and store migrations with `--unsafe-skip-upgrade=$upgradeHeightNumber`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please show full command (or link to script).

I don't know what this means

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like with --halt-height= there's no command that will work for all use cases. I'll add an example, but it's not very usable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#PUC

  • Bullet points
  • Consider making it a separate section below

- Document and add `--unsafe-skip-upgrade=$upgradeHeightNumber` to the scripts so the next time somebody tries to sync the network from genesis - they will automatically skip the failed upgrade.
- Resolve the issue with an upgrade and schedule another plan.

## Option 3: The network is stuck at the future height after the upgrade

This should be treated as a consensus or non-determinism bug that is unrelated to the upgrade. See [Recovery From Chain Halt](../../develop/developer_guide/recovery_from_chain_halt.md) for more information on how to handle such issues.
13 changes: 6 additions & 7 deletions docusaurus/docs/protocol/upgrades/release_process.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,6 @@ sidebar_position: 4
This document is for the Pocket Network protocol team's internal use only.
:::

- [1. Determine if the Release is Consensus-Breaking](#1-determine-if-the-release-is-consensus-breaking)
- [2. Create a GitHub Release](#2-create-a-github-release)
- [Legend](#legend)
- [3. Write an Upgrade Plan](#3-write-an-upgrade-plan)
- [4. Issue Upgrade on TestNet](#4-issue-upgrade-on-testnet)
- [5. Issue Upgrade on MainNet](#5-issue-upgrade-on-mainnet)

### 1. Determine if the Release is Consensus-Breaking

:::note
Expand Down Expand Up @@ -59,12 +52,18 @@ You can find an example [here](https://github.com/pokt-network/poktroll/releases
```text
## Protocol Upgrades

<!--
IMPORTANT:If this release will be used to issue upgrade on the network, add a link to the upgrade code
such as https://github.com/pokt-network/poktroll/blob/main/app/upgrades/historical.go#L51.
-->

- **Planned Upgrade:** ❌ Not applicable for this release.
- **Breaking Change:** ❌ Not applicable for this release.
- **Manual Intervention Required:** ✅ Yes, but only for Alpha TestNet participants. If you are participating, please follow the [instructions provided here](https://dev.poktroll.com/operate/quickstart/docker_compose_walkthrough#restarting-a-full-node-after-re-genesis-) for restarting your full node after re-genesis.
- **Upgrade Height:** ❌ Not applicable for this release.

## What's Changed

<!-- GitHub Release Notes continue here -->
```

Expand Down
19 changes: 7 additions & 12 deletions docusaurus/docs/protocol/upgrades/upgrade_list.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ sidebar_position: 1
The tables below provide a list of past and upcoming protocol upgrades. For more detailed information about what upgrades are, how they work, and what changes they bring to the protocol, please refer to our [upgrade overview page](./protocol_upgrades.md).

- [Legend](#legend)
- [TestNet](#testnet)
- [Alpha TestNet](#alpha-testnet)
- [MainNet](#mainnet)

## Legend
Expand All @@ -18,20 +18,15 @@ The tables below provide a list of past and upcoming protocol upgrades. For more
- ❓ - Unknown/To Be Determined
- ⚠️ - Warning/Caution Required

## TestNet

:::warning
This table is currently incomplete and does not include all protocol upgrades. Our recent TestNet upgrades, which were performed via a regenesis, are not listed here.
:::
## Alpha TestNet

<!-- DEVELOPER: if important information about the release is changing (e.g. upgrade hight is changed) - make sure to update the information in GitHub relase as well. -->

| Version | Planned | Breaking | Requires Manual Intervention | Upgrade Height |
| ------------------------------------------------------------------------ | :-----: | :------: | :---------------------------------: | -------------- |
| [`v0.0.7`](https://github.com/pokt-network/poktroll/releases/tag/v0.0.7) | ❓ | ❓ | ✅ (Alpha TestNet Participants Only) | ❓ |
| [`v0.0.6`](https://github.com/pokt-network/poktroll/releases/tag/v0.0.6) | ❓ | ❓ | ✅ (Alpha TestNet Participants Only) | ❓ |
| [`v0.0.5`](https://github.com/pokt-network/poktroll/releases/tag/v0.0.5) | ❓ | ❓ | ✅ (Alpha TestNet Participants Only) | ❓ |
| [`v0.0.4`](https://github.com/pokt-network/poktroll/releases/tag/v0.0.4) | ❓ | ❓ | ✅ (Alpha TestNet Participants Only) | ❓ |
| Version | Planned | Breaking | Requires Manual Intervention | Upgrade Height |
| ---------------------------------------------------------------------------- | :-----: | :------: | :-------------------------------: | -------------- |
| [`v0.0.9-3`](https://github.com/pokt-network/poktroll/releases/tag/v0.0.9-3) | ❌ | ✅ | ⚠️ Alpha TestNet Participants Only | `17102` |
| [`v0.0.9`](https://github.com/pokt-network/poktroll/releases/tag/v0.0.9) | ❓ | ❓ | N/A: genesis version | ❓ |


## MainNet

Expand Down
Loading
Loading