Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled-agent: don't block during instance creation request from nexus #4691

Merged

Conversation

lifning
Copy link
Contributor

@lifning lifning commented Dec 14, 2023

  • alleviating request timeouts occurring when propolis zone installation takes too long - Propolis zone installation took 81 seconds and caused instance start to time out #3927 - by making the zone installation not happen during a request handler, and instead having sled-agent report the result of instance creation back to nexus with an internal cpapi call (see the description of Add some unit tests for sled-agent Instance creation #4489 (which is now rolled into this PR) for more detail)
  • in this world where instance create calls to nexus now no longer block on sled-agent's propolis zone installation finishing, care must be taken in e.g. integ tests not to assume you can, say, connect to the instance as soon as the instance create call returns. accordingly, integ tests have been changed to poll until the instance is ready.

@lifning lifning force-pushed the sled-agent-instance-creation-refactor branch 3 times, most recently from 14fd9b5 to 8f54cb7 Compare December 21, 2023 20:37
@lifning lifning force-pushed the sled-agent-instance-creation-refactor branch 5 times, most recently from 56c85a7 to f7945e8 Compare January 13, 2024 03:05
@lifning lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from b6f0150 to 7e7ed02 Compare January 20, 2024 07:52
@lifning lifning changed the title sled-agent instance creation asynchronous refactor sled-agent instance creation asynchronous rework Jan 20, 2024
@lifning lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from c6539ae to 3827194 Compare January 24, 2024 05:29
@lifning lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from b606a57 to b59dace Compare February 2, 2024 10:14
@lifning lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from 042ece7 to 7bfa9b4 Compare February 15, 2024 07:42
@lifning lifning force-pushed the sled-agent-instance-creation-refactor branch 2 times, most recently from dae6302 to d17c0a4 Compare March 5, 2024 07:16
@lifning lifning marked this pull request as ready for review March 5, 2024 09:03
@lifning lifning changed the title sled-agent instance creation asynchronous rework sled-agent: don't block during instance creation request from nexus Mar 5, 2024
@gjcolombo gjcolombo self-requested a review March 6, 2024 03:25
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! I have a few implementation-specific comments and one larger observation:

On my reading, the new code preserves Nexus's existing rule that if a Propolis zone hasn't come up within X seconds, then the instance should be marked as Failed. The differences are mostly in the value of X (120 seconds in the PR vs. 60 now) and in the mechanism by which the Failed state is applied (HTTP request timeout now vs. expiration of a sleep on a Nexus task in the PR). The biggest additional difference that I can see is that in the PR, if the instance has Failed by the time sled agent reports its status, sled agent will explicitly clean up the zone it created (though this unfortunately doesn't free up space on the sled from Nexus's perspective because of #4226).

I might just be missing something about how this works, but if I have it right, I'm worried that the new behavioral juice isn't worth the squeeze of the complexity this PR adds (in the form of an extra sled agent/Nexus round trip whose synchronization and error semantics we have to think about and in the form of extra constraints on when and how instance generation numbers must change).

What would you think of changing Nexus's behavior so that it doesn't enforce a timeout on Propolis zone startup at all? That is, on instance start:

  • Nexus moves the relevant Instance to Starting and sets its VMM ID
  • Nexus calls the relevant sled's "ensure registered" endpoint (this already returns quickly)
  • Nexus calls the sled's "ensure state" endpoint to start the instance; this should now return right away without blocking on a response from the sled agent instance task
  • Later, when the sled's Propolis zone comes up and Propolis starts, it will publish a transition to Running that will be handled by the normal means (sled agent will call cpapi_instances_put to change the VMM state and the instance's effective state will change accordingly)

I think this is more or less what the sled agent portion of the PR does today, but without the need for an additional Nexus task to enforce the start timeout or extra internal APIs to receive status updates from sled agents.

The main downside of doing this is of course that if something does go wrong in instance startup (e.g., the sled bounces), the instance will get stuck in Starting instead of going to Failed, and it's harder (read: next to impossible) to recover an instance that's stuck in Starting. I think that may be justifiable here, though, on the following reasoning:

WDYT of this approach?

sled-agent/src/params.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/internal_api/http_entrypoints.rs Outdated Show resolved Hide resolved
illumos-utils/src/running_zone.rs Show resolved Hide resolved
@lifning
Copy link
Contributor Author

lifning commented Mar 7, 2024

Thanks for putting this together! I have a few implementation-specific comments and one larger observation:
[...]
I might just be missing something about how this works, but if I have it right, I'm worried that the new behavioral juice isn't worth the squeeze of the complexity this PR adds (in the form of an extra sled agent/Nexus round trip whose synchronization and error semantics we have to think about and in the form of extra constraints on when and how instance generation numbers must change).

What would you think of changing Nexus's behavior so that it doesn't enforce a timeout on Propolis zone startup at all? That is, on instance start:

  • Nexus moves the relevant Instance to Starting and sets its VMM ID
  • Nexus calls the relevant sled's "ensure registered" endpoint (this already returns quickly)
  • Nexus calls the sled's "ensure state" endpoint to start the instance; this should now return right away without blocking on a response from the sled agent instance task
  • Later, when the sled's Propolis zone comes up and Propolis starts, it will publish a transition to Running that will be handled by the normal means (sled agent will call cpapi_instances_put to change the VMM state and the instance's effective state will change accordingly)

I think this is more or less what the sled agent portion of the PR does today, but without the need for an additional Nexus task to enforce the start timeout or extra internal APIs to receive status updates from sled agents.
[...]

WDYT of this approach?

that certainly sounds like a better compromise! and later, if we have a nexus background task monitoring all the known instances for any who've been in Starting too long, we can act on it then with the existing calls. but yeah, i'm good with throwing out the return-trip of this as long as the resulting behavioral change of "we simply don't time out instance creation at all any more" is acceptable.

@lifning lifning force-pushed the sled-agent-instance-creation-refactor branch from d17c0a4 to 29544e2 Compare March 13, 2024 10:46
Comment on lines 198 to 213

match target {
// these may involve a long-running zone creation, so avoid HTTP
// request timeouts by decoupling the response
// (see InstanceRunner::put_state)
InstanceStateRequested::MigrationTarget(_)
| InstanceStateRequested::Running => {
// don't error on channel being closed
tokio::spawn(rx);
Ok(InstancePutStateResponse { updated_runtime: None })
}
InstanceStateRequested::Stopped
| InstanceStateRequested::Reboot => rx.await?,
}
Copy link
Contributor Author

@lifning lifning Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the nexus-side out-of-band timeout stuff removed (and thanks to the recent restructure of InstanceManager to have its own worker task), the crux of this change is now just this block. (the rest being support for unit-testing sled-agent's Instance code at all, and changes to tests to accommodate the different behavior)

((i went ahead and "resolved" the comments that were about code that's now been reverted))

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifning This looks really good. The tests are super thorough. I appreciate the hassle you had to go through to setup all the params to pass into the Instance constructor!

I only had a few minor quibbles. The ones that referenced @smklein he should take a quick look at. If he's fine leaving them I think this is good to merge once my other comments are resolved.

InstanceStateRequested::MigrationTarget(_)
| InstanceStateRequested::Running => {
// don't error on channel being closed
tokio::spawn(rx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a clever way to do this!

Nit: I would probably enhance the comment to make it a bit clearer, such as: We don't want the sending side of the channel to see an error if we drop rx without awaiting it. Since we don't care about the response here, we go ahead and spawn rx into a task which will await it for us in the background.

@@ -1193,29 +1193,55 @@ impl ZoneBuilderFactory {
/// Created by [ZoneBuilderFactory].
#[derive(Default)]
pub struct ZoneBuilder<'a> {
/// Logger to which status messages are written during zone installation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the additional info!

@@ -119,6 +119,19 @@ impl VmmReservoirManagerHandle {
}
rx.await.map_err(|_| Error::ReplySenderDropped)?
}

#[cfg(test)]
pub fn stub_for_test() -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it :)

};
use omicron_common::api::internal::nexus::InstanceProperties;
use sled_storage::disk::{RawDisk, SyntheticDisk};
use sled_storage::manager::FakeStorageManager;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smklein is removing FakeStorageManager in #5172. I don't believe this PR should necessarily wait for that, but I want to put it on Sean's radar.


let (put_tx, put_rx) = oneshot::channel();

inst.put_state(put_tx, InstanceStateRequested::Running)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be the real start of the code, after all the setup. I think this test would be clearer if you indicated what was going on here. I believe this call to inst.put_state is actually sending a request to start the instance to a fake propolis server which eventually acks that the instance is running by replying on put_rx. A comment to that effect would help readers of the code.

use illumos_utils::opte::params::DhcpConfig;
use illumos_utils::svc::__wait_for_service::Context as MockWaitForServiceContext;
use illumos_utils::zone::MockZones;
use illumos_utils::zone::__mock_MockZones::__boot::Context as MockZonesBootContext;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we are largely trying to get rid of mocks, as they introduce global state across tests. I"m not sure if there is a better way to do this other than some direct fake trickery though. CC @smklein

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, i think ideally we eventually replace all the uses of mockall in illumos_utils with fakes that we can plumb through anything that uses them - which may be a bit cumbersome, as currently a great deal of that functionality (including what's used here) are static methods.

though if i understand correctly, nextest runs each test case in its own process, and if so these simple stubs returning a static value shouldn't be stepping on anything substantial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem that I ran into that was somewhat alluded to was #4481. It doesn't appear to be a problem specifically here though.

inst.put_state(put_tx, InstanceStateRequested::Running)
.await
.expect("failed to send Instance::put_state");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call tokio::time::advance here, so that we don't have to wait for the full TIMEOUT_DURATION? This will also allow us to increase that duration so without making the success case necessarily take longer, but still giving plenty of time so we don't get flaky runs in CI.

I think to do this, you'll have to create the timeout future, then advance the time, and then await the future.

.await
.expect("failed to send Instance::put_state");

timeout(TIMEOUT_DURATION, put_rx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, let's use tokio::time::advance.

At time of writing, instance creation roughly looks like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: `Instance::propolis_ensure`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
      - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
        - `RunningZone::install` and `Zones::boot`
        - `illumos_utils::svc::wait_for_service`
        - `self::wait_for_http_server` for propolis-server itself
      - sled-agent: `Instance::ensure_propolis_and_tasks`
        - sled-agent: spawn `Instance::monitor_state_task`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
    - sled-agent: return ok result
- nexus: `handle_instance_put_result`

Or at least, it does in the happy path. omicron#3927 saw propolis zone
creation take longer than the minute nexus's call to sled-agent's
`instance_put_state`. That might've looked something like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: `Instance::propolis_ensure`
      - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
      - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
        - `RunningZone::install` and `Zones::boot`
- nexus: i've been waiting a whole minute for this. connection timeout!
- nexus: `handle_instance_put_result`
    - sled-agent: [...] return... oh, they hung up. :(

To avoid this timeout being implicit at the *Dropshot configuration*
layer (that is to say, we should still have *some* timeout),
we could consider a small refactor to make `instance_put_state` not a
blocking call -- especially since it's already sending nexus updates on
its progress via out-of-band `cpapi_instances_put` calls! That might look
something like:

- nexus -> sled-agent: `instance_put_state`
  - sled-agent: `InstanceManager::ensure_state`
    - sled-agent: spawn {
      - sled-agent: `Instance::propolis_ensure`
        - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
        - sled-agent: `Instance::setup_propolis_locked` (blocking!)
        - sled-agent: `Instance::ensure_propolis_and_tasks`
          - sled-agent: spawn `Instance::monitor_state_task`
        - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
        - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call

(With a way for nexus to cancel an instance creation by ID, and a timeout
in sled-agent itself for terminating the attempt and reporting the failure
back to nexus, and a shorter threshold for logging the event of an instance
creation taking a long time.)

Before such a change, though, we should really have some more tests around
sled-agent's instance creation code at all! So here's a few.
lif added 7 commits March 16, 2024 03:50
Alleviating request timeouts occurring when propolis zone installation takes too long
(Propolis zone installation took 81 seconds and caused instance start to time out oxidecomputer#3927)
by making the zone installation not happen during a request handler.

Since the instance creation request no longer blocks, we need to wait before proceeding in some cases where we had assumed that a successful return from the Nexus call meant the instance existed,
e.g. test_instance_serial now polls for the instance's running state before attempting to send serial console data requests.
@lifning lifning force-pushed the sled-agent-instance-creation-refactor branch from 67efdae to d5da2d4 Compare March 16, 2024 10:51
@lifning lifning merged commit f8c8bb1 into oxidecomputer:main Mar 19, 2024
15 checks passed
andrewjstone added a commit that referenced this pull request Mar 20, 2024
Restore instance tests added in #4691 so that they work with the latest
updates in #5172. The tests now use the new `StorageManagerTestHarness`
from #5172 instead of the since removed `FakeStorageManager`.
Additionally, the dns server code from `sled-agent/src/fakes/nexus` is
used rather than the reimplementation in #4691. This shrinks the amount
of code added with these tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Propolis zone installation took 81 seconds and caused instance start to time out
3 participants