sled-agent: don't block during instance creation request from nexus #4691

lifning · 2023-12-14T01:52:07Z

alleviating request timeouts occurring when propolis zone installation takes too long - Propolis zone installation took 81 seconds and caused instance start to time out #3927 - by making the zone installation not happen during a request handler, and instead having sled-agent report the result of instance creation back to nexus with an internal cpapi call (see the description of Add some unit tests for sled-agent Instance creation #4489 (which is now rolled into this PR) for more detail)
in this world where instance create calls to nexus now no longer block on sled-agent's propolis zone installation finishing, care must be taken in e.g. integ tests not to assume you can, say, connect to the instance as soon as the instance create call returns. accordingly, integ tests have been changed to poll until the instance is ready.

gjcolombo

Thanks for putting this together! I have a few implementation-specific comments and one larger observation:

On my reading, the new code preserves Nexus's existing rule that if a Propolis zone hasn't come up within X seconds, then the instance should be marked as Failed. The differences are mostly in the value of X (120 seconds in the PR vs. 60 now) and in the mechanism by which the Failed state is applied (HTTP request timeout now vs. expiration of a sleep on a Nexus task in the PR). The biggest additional difference that I can see is that in the PR, if the instance has Failed by the time sled agent reports its status, sled agent will explicitly clean up the zone it created (though this unfortunately doesn't free up space on the sled from Nexus's perspective because of #4226).

I might just be missing something about how this works, but if I have it right, I'm worried that the new behavioral juice isn't worth the squeeze of the complexity this PR adds (in the form of an extra sled agent/Nexus round trip whose synchronization and error semantics we have to think about and in the form of extra constraints on when and how instance generation numbers must change).

What would you think of changing Nexus's behavior so that it doesn't enforce a timeout on Propolis zone startup at all? That is, on instance start:

Nexus moves the relevant Instance to Starting and sets its VMM ID
Nexus calls the relevant sled's "ensure registered" endpoint (this already returns quickly)
Nexus calls the sled's "ensure state" endpoint to start the instance; this should now return right away without blocking on a response from the sled agent instance task
Later, when the sled's Propolis zone comes up and Propolis starts, it will publish a transition to Running that will be handled by the normal means (sled agent will call cpapi_instances_put to change the VMM state and the instance's effective state will change accordingly)

I think this is more or less what the sled agent portion of the PR does today, but without the need for an additional Nexus task to enforce the start timeout or extra internal APIs to receive status updates from sled agents.

The main downside of doing this is of course that if something does go wrong in instance startup (e.g., the sled bounces), the instance will get stuck in Starting instead of going to Failed, and it's harder (read: next to impossible) to recover an instance that's stuck in Starting. I think that may be justifiable here, though, on the following reasoning:

Having fewer instances go to Failed outweighs the marginal risk of an instance that's stuck in Starting
- Failed is better than Starting, but Failed instances are still pretty painful to deal with; you can delete them but that still leaks resources (per the discussion in and TODOs linked to Revisit moving instances to Failed in handle_instance_put_result #4226 and need a way to trigger cleanup and next steps for vanished instances #4872)
- Sled/sled agent bounces are much rarer than long-running zone startup commands, so the pain we save from having fewer instances spuriously go to Failed outweighs the risk of them getting stuck in Starting
Simplifying this PR frees up some complexity budget that we can spend on Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state #4004, which if successful would give users a way to fix all "stuck in Starting"/"stuck in Stopping" problems (not just those caused by zone startup timeouts)

WDYT of this approach?

sled-agent/src/params.rs

nexus/src/app/instance.rs

nexus/src/internal_api/http_entrypoints.rs

illumos-utils/src/running_zone.rs

lifning · 2024-03-07T10:03:37Z

Thanks for putting this together! I have a few implementation-specific comments and one larger observation:
[...]
I might just be missing something about how this works, but if I have it right, I'm worried that the new behavioral juice isn't worth the squeeze of the complexity this PR adds (in the form of an extra sled agent/Nexus round trip whose synchronization and error semantics we have to think about and in the form of extra constraints on when and how instance generation numbers must change).

What would you think of changing Nexus's behavior so that it doesn't enforce a timeout on Propolis zone startup at all? That is, on instance start:

Nexus moves the relevant Instance to Starting and sets its VMM ID

Nexus calls the relevant sled's "ensure registered" endpoint (this already returns quickly)

Nexus calls the sled's "ensure state" endpoint to start the instance; this should now return right away without blocking on a response from the sled agent instance task

Later, when the sled's Propolis zone comes up and Propolis starts, it will publish a transition to Running that will be handled by the normal means (sled agent will call cpapi_instances_put to change the VMM state and the instance's effective state will change accordingly)

I think this is more or less what the sled agent portion of the PR does today, but without the need for an additional Nexus task to enforce the start timeout or extra internal APIs to receive status updates from sled agents.
[...]

WDYT of this approach?

that certainly sounds like a better compromise! and later, if we have a nexus background task monitoring all the known instances for any who've been in Starting too long, we can act on it then with the existing calls. but yeah, i'm good with throwing out the return-trip of this as long as the resulting behavioral change of "we simply don't time out instance creation at all any more" is acceptable.

lifning · 2024-03-14T05:16:33Z

sled-agent/src/instance_manager.rs

+
+        match target {
+            // these may involve a long-running zone creation, so avoid HTTP
+            // request timeouts by decoupling the response
+            // (see InstanceRunner::put_state)
+            InstanceStateRequested::MigrationTarget(_)
+            | InstanceStateRequested::Running => {
+                // don't error on channel being closed
+                tokio::spawn(rx);
+                Ok(InstancePutStateResponse { updated_runtime: None })
+            }
+            InstanceStateRequested::Stopped
+            | InstanceStateRequested::Reboot => rx.await?,
+        }


with the nexus-side out-of-band timeout stuff removed (and thanks to the recent restructure of InstanceManager to have its own worker task), the crux of this change is now just this block. (the rest being support for unit-testing sled-agent's Instance code at all, and changes to tests to accommodate the different behavior)

((i went ahead and "resolved" the comments that were about code that's now been reverted))

andrewjstone

@lifning This looks really good. The tests are super thorough. I appreciate the hassle you had to go through to setup all the params to pass into the Instance constructor!

I only had a few minor quibbles. The ones that referenced @smklein he should take a quick look at. If he's fine leaving them I think this is good to merge once my other comments are resolved.

andrewjstone · 2024-03-15T20:19:35Z

sled-agent/src/instance_manager.rs

+            InstanceStateRequested::MigrationTarget(_)
+            | InstanceStateRequested::Running => {
+                // don't error on channel being closed
+                tokio::spawn(rx);


That's a clever way to do this!

Nit: I would probably enhance the comment to make it a bit clearer, such as: We don't want the sending side of the channel to see an error if we drop rx without awaiting it. Since we don't care about the response here, we go ahead and spawn rx into a task which will await it for us in the background.

andrewjstone · 2024-03-15T20:31:08Z

illumos-utils/src/running_zone.rs

@@ -1193,29 +1193,55 @@ impl ZoneBuilderFactory {
 /// Created by [ZoneBuilderFactory].
 #[derive(Default)]
 pub struct ZoneBuilder<'a> {
+    /// Logger to which status messages are written during zone installation.


Thanks for all the additional info!

andrewjstone · 2024-03-15T20:36:54Z

sled-agent/src/vmm_reservoir.rs

@@ -119,6 +119,19 @@ impl VmmReservoirManagerHandle {
        }
        rx.await.map_err(|_| Error::ReplySenderDropped)?
    }
+
+    #[cfg(test)]
+    pub fn stub_for_test() -> Self {


I like it :)

andrewjstone · 2024-03-15T20:46:49Z

sled-agent/src/instance.rs

+    };
+    use omicron_common::api::internal::nexus::InstanceProperties;
+    use sled_storage::disk::{RawDisk, SyntheticDisk};
+    use sled_storage::manager::FakeStorageManager;


@smklein is removing FakeStorageManager in #5172. I don't believe this PR should necessarily wait for that, but I want to put it on Sean's radar.

andrewjstone · 2024-03-15T21:07:48Z

sled-agent/src/instance.rs

+
+        let (put_tx, put_rx) = oneshot::channel();
+
+        inst.put_state(put_tx, InstanceStateRequested::Running)


This appears to be the real start of the code, after all the setup. I think this test would be clearer if you indicated what was going on here. I believe this call to inst.put_state is actually sending a request to start the instance to a fake propolis server which eventually acks that the instance is running by replying on put_rx. A comment to that effect would help readers of the code.

andrewjstone · 2024-03-15T21:08:57Z

sled-agent/src/instance.rs

+    use illumos_utils::opte::params::DhcpConfig;
+    use illumos_utils::svc::__wait_for_service::Context as MockWaitForServiceContext;
+    use illumos_utils::zone::MockZones;
+    use illumos_utils::zone::__mock_MockZones::__boot::Context as MockZonesBootContext;


I believe we are largely trying to get rid of mocks, as they introduce global state across tests. I"m not sure if there is a better way to do this other than some direct fake trickery though. CC @smklein

right, i think ideally we eventually replace all the uses of mockall in illumos_utils with fakes that we can plumb through anything that uses them - which may be a bit cumbersome, as currently a great deal of that functionality (including what's used here) are static methods.

though if i understand correctly, nextest runs each test case in its own process, and if so these simple stubs returning a static value shouldn't be stepping on anything substantial

The problem that I ran into that was somewhat alluded to was #4481. It doesn't appear to be a problem specifically here though.

andrewjstone · 2024-03-15T21:14:08Z

sled-agent/src/instance.rs

+        inst.put_state(put_tx, InstanceStateRequested::Running)
+            .await
+            .expect("failed to send Instance::put_state");
+


Can we call tokio::time::advance here, so that we don't have to wait for the full TIMEOUT_DURATION? This will also allow us to increase that duration so without making the success case necessarily take longer, but still giving plenty of time so we don't get flaky runs in CI.

I think to do this, you'll have to create the timeout future, then advance the time, and then await the future.

andrewjstone · 2024-03-15T21:15:17Z

sled-agent/src/instance.rs

+            .await
+            .expect("failed to send Instance::put_state");
+
+        timeout(TIMEOUT_DURATION, put_rx)


Same as above, let's use tokio::time::advance.

At time of writing, instance creation roughly looks like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - `illumos_utils::svc::wait_for_service` - `self::wait_for_http_server` for propolis-server itself - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: return ok result - nexus: `handle_instance_put_result` Or at least, it does in the happy path. omicron#3927 saw propolis zone creation take longer than the minute nexus's call to sled-agent's `instance_put_state`. That might've looked something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - nexus: i've been waiting a whole minute for this. connection timeout! - nexus: `handle_instance_put_result` - sled-agent: [...] return... oh, they hung up. :( To avoid this timeout being implicit at the *Dropshot configuration* layer (that is to say, we should still have *some* timeout), we could consider a small refactor to make `instance_put_state` not a blocking call -- especially since it's already sending nexus updates on its progress via out-of-band `cpapi_instances_put` calls! That might look something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: spawn { - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (blocking!) - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call (With a way for nexus to cancel an instance creation by ID, and a timeout in sled-agent itself for terminating the attempt and reporting the failure back to nexus, and a shorter threshold for logging the event of an instance creation taking a long time.) Before such a change, though, we should really have some more tests around sled-agent's instance creation code at all! So here's a few.

Alleviating request timeouts occurring when propolis zone installation takes too long (Propolis zone installation took 81 seconds and caused instance start to time out oxidecomputer#3927) by making the zone installation not happen during a request handler. Since the instance creation request no longer blocks, we need to wait before proceeding in some cases where we had assumed that a successful return from the Nexus call meant the instance existed, e.g. test_instance_serial now polls for the instance's running state before attempting to send serial console data requests.

Restore instance tests added in #4691 so that they work with the latest updates in #5172. The tests now use the new `StorageManagerTestHarness` from #5172 instead of the since removed `FakeStorageManager`. Additionally, the dns server code from `sled-agent/src/fakes/nexus` is used rather than the reimplementation in #4691. This shrinks the amount of code added with these tests.

lifning force-pushed the sled-agent-instance-creation-refactor branch 3 times, most recently from 14fd9b5 to 8f54cb7 Compare December 21, 2023 20:37

lifning force-pushed the sled-agent-instance-creation-refactor branch 5 times, most recently from 56c85a7 to f7945e8 Compare January 13, 2024 03:05

lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from b6f0150 to 7e7ed02 Compare January 20, 2024 07:52

lifning changed the title ~~sled-agent instance creation asynchronous refactor~~ sled-agent instance creation asynchronous rework Jan 20, 2024

lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from c6539ae to 3827194 Compare January 24, 2024 05:29

lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from b606a57 to b59dace Compare February 2, 2024 10:14

lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from 042ece7 to 7bfa9b4 Compare February 15, 2024 07:42

lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from dae6302 to d17c0a4 Compare March 5, 2024 07:16

lifning mentioned this pull request Mar 5, 2024

Add some unit tests for sled-agent Instance creation #4489

Closed

lifning marked this pull request as ready for review March 5, 2024 09:03

lifning changed the title ~~sled-agent instance creation asynchronous rework~~ sled-agent: don't block during instance creation request from nexus Mar 5, 2024

gjcolombo self-requested a review March 6, 2024 03:25

gjcolombo reviewed Mar 6, 2024

View reviewed changes

morlandi7 linked an issue Mar 11, 2024 that may be closed by this pull request

Propolis zone installation took 81 seconds and caused instance start to time out #3927

Closed

lifning force-pushed the sled-agent-instance-creation-refactor branch from d17c0a4 to 29544e2 Compare March 13, 2024 10:46

lifning commented Mar 14, 2024

View reviewed changes

andrewjstone approved these changes Mar 15, 2024

View reviewed changes

lif added 7 commits March 16, 2024 03:50

post-rebase updates

a007503

fmt

b026a35

clean up fake zone files after tests

71de433

fixup

d55db5b

cleanup fmt stable oddity

aa9da2c

use tokio::time::advance

d5da2d4

lifning force-pushed the sled-agent-instance-creation-refactor branch from 67efdae to d5da2d4 Compare March 16, 2024 10:51

lifning merged commit f8c8bb1 into oxidecomputer:main Mar 19, 2024
15 checks passed

andrewjstone mentioned this pull request Mar 20, 2024

Restore instance tests to #5172 #5294

Merged

smklein mentioned this pull request Jul 30, 2024

Pull in updated Propolis with new VM timeseries, and expunge the old #6169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled-agent: don't block during instance creation request from nexus #4691

sled-agent: don't block during instance creation request from nexus #4691

lifning commented Dec 14, 2023 •

edited

Loading

gjcolombo left a comment

lifning commented Mar 7, 2024

lifning Mar 14, 2024 •

edited

Loading

andrewjstone left a comment

andrewjstone Mar 15, 2024

andrewjstone Mar 15, 2024

andrewjstone Mar 15, 2024

andrewjstone Mar 15, 2024

andrewjstone Mar 15, 2024

andrewjstone Mar 15, 2024

lifning Mar 19, 2024

andrewjstone Mar 19, 2024

andrewjstone Mar 15, 2024

andrewjstone Mar 15, 2024


		let (put_tx, put_rx) = oneshot::channel();

		inst.put_state(put_tx, InstanceStateRequested::Running)

sled-agent: don't block during instance creation request from nexus #4691

sled-agent: don't block during instance creation request from nexus #4691

Conversation

lifning commented Dec 14, 2023 • edited Loading

gjcolombo left a comment

Choose a reason for hiding this comment

lifning commented Mar 7, 2024

lifning Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lifning commented Dec 14, 2023 •

edited

Loading

lifning Mar 14, 2024 •

edited

Loading