Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] upgrade/chain halt recovery #837

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

[Docs] upgrade/chain halt recovery #837

wants to merge 18 commits into from

Conversation

okdas
Copy link
Member

@okdas okdas commented Sep 24, 2024

Summary

Performed the first upgrade on the Alpha TestNet. Add some documentation changes to prevent some issues in the future.

Issue

N/A

Type of change

Select one or more from the following:

Testing

  • Documentation: make docusaurus_start; only needed if you make doc changes
  • Unit Tests: make go_develop_and_test
  • LocalNet E2E Tests: make test_e2e
  • DevNet E2E Tests: Add the devnet-test-e2e label to the PR.

Sanity Checklist

  • I have tested my changes using the available tooling
  • I have commented my code
  • I have performed a self-review of my own code; both comments & source code
  • I create and reference any new tickets, if applicable
  • I have left TODOs throughout the codebase, if applicable

@okdas okdas self-assigned this Sep 24, 2024
@okdas okdas added the documentation Improvements or additions to documentation label Sep 24, 2024
@okdas okdas added this to the Shannon Beta TestNet Launch milestone Sep 24, 2024
@okdas okdas requested a review from Olshansk September 24, 2024 23:50
@okdas okdas changed the title [Upgrades] Document learnings from the past upgrade [Upgrades] Document learnings from the last upgrade Sep 25, 2024
@okdas
Copy link
Member Author

okdas commented Sep 26, 2024

Items to add:

  • Configure cosmovisor to create backups before upgrade
  • Have more bump validators to 5.
  • Instructions on how to test the upgrade locally. Thinking a script w/o LocalNet.

@Olshansk
Copy link
Member

@okdas Let's add the details here as well: https://x.com/olshansky/status/1846211059989778741

@okdas okdas marked this pull request as ready for review October 17, 2024 01:18
@okdas
Copy link
Member Author

okdas commented Oct 17, 2024

@Olshansk can I get a review on this PR, please?

I think I want to investigate/add information about the new Cosmovisor feature mentioned in cosmos-sdk slack, but I'll look into that later. :)

@okdas okdas changed the title [Upgrades] Document learnings from the last upgrade [Upgrades] Document upgrade/chain halt recovery Oct 17, 2024
@okdas okdas changed the title [Upgrades] Document upgrade/chain halt recovery [Docs] upgrade/chain halt recovery Oct 17, 2024
Copy link

The CI will now also run the e2e tests on devnet, which increases the time it takes to complete all CI checks.

You may need to run make trigger_ci to submit an empty commit that'll trigger the tests.

GCP workloads (requires changing the namespace to 837)
Grafana network dashboard for devnet-issue-837

@github-actions github-actions bot added devnet push-image CI related - pushes images to ghcr.io labels Oct 17, 2024
# Use the direct download link for the latest release
LATEST_RELEASE_URL="https://github.com/pokt-network/poktroll/releases/latest/download/poktroll_linux_${ARCH}.tar.gz"
# Get the version genesis started from
POKTROLLD_VERSION=$(curl -s https://raw.githubusercontent.com/pokt-network/pocket-network-genesis/master/poktrolld/testnet-validated.init-version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always going to point github?

#PUC or add TODO to answer the question.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased this part, but I don't fully understand what you're asking.
If you think that there's a better way, please let me know. I thought we could use the app_version from genesis.json (looks like this: "app_version": "0.0.9") but decided I'd rather store this information in a separate file.


Read more about [upgrade contingency plans](../../protocol/upgrades/contigency_plans.md).

### Manual binary replacement (preferred)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this section w/ links to the binaries for easier access.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no binaries to point to. Rephrased.


Instead, we need **social consensus** to manually replace the binary and get the chain moving.

Currently this involves synching the network from genesis breaking a way to sync the network from genesis without human interaction, but there are some plans to make the process less painful in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what you're trying to say with this sentence. #PUC

Copy link
Member Author

@okdas okdas Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL that sentence makes no sense. The original was more comprehensible.

In such a case, we need to:

- Roll back validators to the backup (a snapshot is taken by `cosmovisor` automatically prior to upgrade, if `UNSAFE_SKIP_BACKUP` is set to `false`).
- Skip the upgrade handler and store migrations with `--unsafe-skip-upgrade=$upgradeHeightNumber`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please show full command (or link to script).

I don't know what this means

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like with --halt-height= there's no command that will work for all use cases. I'll add an example, but it's not very usable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#PUC

  • Bullet points
  • Consider making it a separate section below

docusaurus/docs/protocol/upgrades/contigency_plans.md Outdated Show resolved Hide resolved

- The old binary should be compiled to work before the upgrade.
- The new binary should contain the upgrade logic to be executed immediately after the node is started using the new binary.
9. Wait until the height is reached and the old node dies due to the error: `ERR UPGRADE "v0.0.9-2" NEEDED at height`, which is expected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a little bit of partial cleanup, but can you do another look through this and make it so copy-pastable that any idiot can follow?

For example, instead of "from the new version", just add the appropriate cd ../... in there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I know what the new version path on the developer's system? It might be a primary poktroll repo on the developer machine, it might not be. ¯\_(ツ)_/¯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an instruction where they run poktroll version to ensure it's correct?

It's one of the steps in the upgrade playbook for Morse:

Screenshot 2024-10-24 at 12 18 01 PM

docusaurus/docs/protocol/upgrades/upgrade_procedure.md Outdated Show resolved Hide resolved

- The old binary should be compiled to work before the upgrade.
- The new binary should contain the upgrade logic to be executed immediately after the node is started using the new binary.
9. Wait until the height is reached and the old node dies due to the error: `ERR UPGRADE "v0.0.9-2" NEEDED at height`, which is expected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I observe the height?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all on the screen when you run the command

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#PUC? Screenshot? Video?

@Olshansk
Copy link
Member

Olshansk commented Nov 6, 2024

@okdas Friendly reminder on this when you're back.

Making it easier to inspect/find/identify chain halts on testnet would be great as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devnet devnet-test-e2e documentation Improvements or additions to documentation push-image CI related - pushes images to ghcr.io
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

2 participants