-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(upgrader): handle unstoppable station in disaster recovery #389
base: main
Are you sure you want to change the base?
Conversation
let system_info = get_system_info(&env, WALLET_ADMIN_USER, canister_ids.station); | ||
assert_eq!(system_info.name, new_name); | ||
|
||
// the call request will be "Processing" forever since we deleted its call context during disaster recovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but we should ideally have something in place to perform some cleanup on requests that are forever processing
since they are actually not processing, it's just dangling state
let snapshot = take_canister_snapshot(TakeCanisterSnapshotArgs { | ||
canister_id: station_canister_id, | ||
replace_snapshot: existing_snapshots | ||
.into_iter() | ||
.next() | ||
.map(|snapshot| snapshot.id), | ||
}) | ||
.await | ||
.map_err(|(_code, msg)| msg)? | ||
.0; | ||
uninstall_code(CanisterIdRecord { | ||
canister_id: station_canister_id, | ||
}) | ||
.await | ||
.map_err(|(_code, msg)| msg)?; | ||
load_canister_snapshot(LoadCanisterSnapshotArgs { | ||
canister_id: station_canister_id, | ||
snapshot_id: snapshot.id, | ||
sender_canister_version: None, | ||
}) | ||
.await | ||
.map_err(|(_code, msg)| msg)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this would fail for some reason (e.g. because maybe the target canister would not have enough cycles), then the disaster recovery committee would need to call this endpoint again, however, because we replace that snapshot, then the second try would take a snapshot of an empty canister because we would have already uninstalled it.
To account for such cases, maybe we should store a flag in the upgrader that makes it reuse the same snapshot id on a retry, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! I'd suggest the following:
- support taking and restoring snapshots manually in disaster recovery
- never take a snapshot automatically: in the force mode a manually created snapshot must be specified, i.e.,
force: bool
becomesforce: Option<SnapshotId>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both look like a safer approach, and in the case of force: Option<SnapshotId>
, if provided as None
we would then create the snapshot? as it is the behaviour for force=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force: None
would correspond to force: false
and force: Some(snapshot_id)
would correspond to force: true
This PR handles the case of an unstoppable station during disaster recovery:
force
of disaster recovery to force disaster recovery of an unstoppable canister by taking a snapshot, uninstalling the canister (marking all open call contexts as deleted), restoring the snapshot, and resuming with installing new code (it is safe to not actually stop the canister in this case because all open call contexts are marked as deleted when uninstalling code); this makes sense for an unstoppable Orbit station because a station has all its data in stable memory that can be fully restored from a snapshot