chore(upgrader): handle unstoppable station in disaster recovery #389

mraszyk · 2024-10-18T10:53:19Z

This PR handles the case of an unstoppable station during disaster recovery:

the station canister must always be stopped before installing code (otherwise, it is unsafe to install new code due to open call contexts that might trigger executing unrelated callbacks on the new wasm) and if the station canister cannot be stopped (the ICP has a timeout of 5 mins for stopping), then disaster recovery fails
there's a new Boolean option force of disaster recovery to force disaster recovery of an unstoppable canister by taking a snapshot, uninstalling the canister (marking all open call contexts as deleted), restoring the snapshot, and resuming with installing new code (it is safe to not actually stop the canister in this case because all open call contexts are marked as deleted when uninstalling code); this makes sense for an unstoppable Orbit station because a station has all its data in stable memory that can be fully restored from a snapshot

keplervital · 2024-11-18T09:52:40Z

tests/integration/src/disaster_recovery_tests.rs

+        let system_info = get_system_info(&env, WALLET_ADMIN_USER, canister_ids.station);
+        assert_eq!(system_info.name, new_name);
+
+        // the call request will be "Processing" forever since we deleted its call context during disaster recovery


not related to this PR, but we should ideally have something in place to perform some cleanup on requests that are forever processing since they are actually not processing, it's just dangling state

keplervital · 2024-11-18T09:57:59Z

core/upgrader/impl/src/services/disaster_recovery.rs

+                let snapshot = take_canister_snapshot(TakeCanisterSnapshotArgs {
+                    canister_id: station_canister_id,
+                    replace_snapshot: existing_snapshots
+                        .into_iter()
+                        .next()
+                        .map(|snapshot| snapshot.id),
+                })
+                .await
+                .map_err(|(_code, msg)| msg)?
+                .0;
+                uninstall_code(CanisterIdRecord {
+                    canister_id: station_canister_id,
+                })
+                .await
+                .map_err(|(_code, msg)| msg)?;
+                load_canister_snapshot(LoadCanisterSnapshotArgs {
+                    canister_id: station_canister_id,
+                    snapshot_id: snapshot.id,
+                    sender_canister_version: None,
+                })
+                .await
+                .map_err(|(_code, msg)| msg)?;


if this would fail for some reason (e.g. because maybe the target canister would not have enough cycles), then the disaster recovery committee would need to call this endpoint again, however, because we replace that snapshot, then the second try would take a snapshot of an empty canister because we would have already uninstalled it.

To account for such cases, maybe we should store a flag in the upgrader that makes it reuse the same snapshot id on a retry, wdyt?

Great catch! I'd suggest the following:

support taking and restoring snapshots manually in disaster recovery

never take a snapshot automatically: in the force mode a manually created snapshot must be specified, i.e., force: bool becomes force: Option<SnapshotId>

Both look like a safer approach, and in the case of force: Option<SnapshotId>, if provided as None we would then create the snapshot? as it is the behaviour for force=true

force: None would correspond to force: false and force: Some(snapshot_id) would correspond to force: true

mraszyk added 12 commits October 18, 2024 12:52

chore(upgrader): handle unstoppable station in disaster recovery

d7b4aa9

.

0b1a8da

do not stop after uninstall code

0eeb4fc

fixes

9a2e9e2

Merge branch 'main' into mraszyk/unstoppable-disaster-recovery

9809d50

harden tests

a1aeb34

Merge branch 'main' into mraszyk/unstoppable-disaster-recovery

27335fc

marge

5734426

Merge branch 'main' into mraszyk/unstoppable-disaster-recovery

4465b0d

fix tests

e5f09d4

rename force_stop

36e5f1a

Merge branch 'main' into mraszyk/unstoppable-disaster-recovery

b5094d5

mraszyk marked this pull request as ready for review November 15, 2024 17:23

mraszyk requested a review from a team as a code owner November 15, 2024 17:23

keplervital reviewed Nov 18, 2024

View reviewed changes

mraszyk marked this pull request as draft November 18, 2024 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(upgrader): handle unstoppable station in disaster recovery #389

chore(upgrader): handle unstoppable station in disaster recovery #389

mraszyk commented Oct 18, 2024 •

edited

Loading

keplervital Nov 18, 2024

keplervital Nov 18, 2024

mraszyk Nov 18, 2024

keplervital Nov 18, 2024

mraszyk Nov 18, 2024

chore(upgrader): handle unstoppable station in disaster recovery #389

Are you sure you want to change the base?

chore(upgrader): handle unstoppable station in disaster recovery #389

Conversation

mraszyk commented Oct 18, 2024 • edited Loading

keplervital Nov 18, 2024

Choose a reason for hiding this comment

keplervital Nov 18, 2024

Choose a reason for hiding this comment

mraszyk Nov 18, 2024

Choose a reason for hiding this comment

keplervital Nov 18, 2024

Choose a reason for hiding this comment

mraszyk Nov 18, 2024

Choose a reason for hiding this comment

mraszyk commented Oct 18, 2024 •

edited

Loading