Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(upgrader): handle unstoppable station in disaster recovery #389

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mraszyk
Copy link
Collaborator

@mraszyk mraszyk commented Oct 18, 2024

This PR handles the case of an unstoppable station during disaster recovery:

  • the station canister must always be stopped before installing code (otherwise, it is unsafe to install new code due to open call contexts that might trigger executing unrelated callbacks on the new wasm) and if the station canister cannot be stopped (the ICP has a timeout of 5 mins for stopping), then disaster recovery fails
  • there's a new Boolean option force of disaster recovery to force disaster recovery of an unstoppable canister by taking a snapshot, uninstalling the canister (marking all open call contexts as deleted), restoring the snapshot, and resuming with installing new code (it is safe to not actually stop the canister in this case because all open call contexts are marked as deleted when uninstalling code); this makes sense for an unstoppable Orbit station because a station has all its data in stable memory that can be fully restored from a snapshot

@mraszyk mraszyk marked this pull request as ready for review November 15, 2024 17:23
@mraszyk mraszyk requested a review from a team as a code owner November 15, 2024 17:23
let system_info = get_system_info(&env, WALLET_ADMIN_USER, canister_ids.station);
assert_eq!(system_info.name, new_name);

// the call request will be "Processing" forever since we deleted its call context during disaster recovery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but we should ideally have something in place to perform some cleanup on requests that are forever processing since they are actually not processing, it's just dangling state

Comment on lines +285 to +306
let snapshot = take_canister_snapshot(TakeCanisterSnapshotArgs {
canister_id: station_canister_id,
replace_snapshot: existing_snapshots
.into_iter()
.next()
.map(|snapshot| snapshot.id),
})
.await
.map_err(|(_code, msg)| msg)?
.0;
uninstall_code(CanisterIdRecord {
canister_id: station_canister_id,
})
.await
.map_err(|(_code, msg)| msg)?;
load_canister_snapshot(LoadCanisterSnapshotArgs {
canister_id: station_canister_id,
snapshot_id: snapshot.id,
sender_canister_version: None,
})
.await
.map_err(|(_code, msg)| msg)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this would fail for some reason (e.g. because maybe the target canister would not have enough cycles), then the disaster recovery committee would need to call this endpoint again, however, because we replace that snapshot, then the second try would take a snapshot of an empty canister because we would have already uninstalled it.

To account for such cases, maybe we should store a flag in the upgrader that makes it reuse the same snapshot id on a retry, wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I'd suggest the following:

  • support taking and restoring snapshots manually in disaster recovery
  • never take a snapshot automatically: in the force mode a manually created snapshot must be specified, i.e., force: bool becomes force: Option<SnapshotId>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both look like a safer approach, and in the case of force: Option<SnapshotId>, if provided as None we would then create the snapshot? as it is the behaviour for force=true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force: None would correspond to force: false and force: Some(snapshot_id) would correspond to force: true

@mraszyk mraszyk marked this pull request as draft November 18, 2024 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants