-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sled-agent: don't block during instance creation request from nexus #4691
sled-agent: don't block during instance creation request from nexus #4691
Conversation
lifning
commented
Dec 14, 2023
•
edited
Loading
edited
- alleviating request timeouts occurring when propolis zone installation takes too long - Propolis zone installation took 81 seconds and caused instance start to time out #3927 - by making the zone installation not happen during a request handler, and instead having sled-agent report the result of instance creation back to nexus with an internal cpapi call (see the description of Add some unit tests for sled-agent Instance creation #4489 (which is now rolled into this PR) for more detail)
- in this world where instance create calls to nexus now no longer block on sled-agent's propolis zone installation finishing, care must be taken in e.g. integ tests not to assume you can, say, connect to the instance as soon as the instance create call returns. accordingly, integ tests have been changed to poll until the instance is ready.
14fd9b5
to
8f54cb7
Compare
56c85a7
to
f7945e8
Compare
b6f0150
to
7e7ed02
Compare
c6539ae
to
3827194
Compare
b606a57
to
b59dace
Compare
042ece7
to
7bfa9b4
Compare
dae6302
to
d17c0a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! I have a few implementation-specific comments and one larger observation:
On my reading, the new code preserves Nexus's existing rule that if a Propolis zone hasn't come up within X seconds, then the instance should be marked as Failed. The differences are mostly in the value of X (120 seconds in the PR vs. 60 now) and in the mechanism by which the Failed state is applied (HTTP request timeout now vs. expiration of a sleep on a Nexus task in the PR). The biggest additional difference that I can see is that in the PR, if the instance has Failed by the time sled agent reports its status, sled agent will explicitly clean up the zone it created (though this unfortunately doesn't free up space on the sled from Nexus's perspective because of #4226).
I might just be missing something about how this works, but if I have it right, I'm worried that the new behavioral juice isn't worth the squeeze of the complexity this PR adds (in the form of an extra sled agent/Nexus round trip whose synchronization and error semantics we have to think about and in the form of extra constraints on when and how instance generation numbers must change).
What would you think of changing Nexus's behavior so that it doesn't enforce a timeout on Propolis zone startup at all? That is, on instance start:
- Nexus moves the relevant Instance to Starting and sets its VMM ID
- Nexus calls the relevant sled's "ensure registered" endpoint (this already returns quickly)
- Nexus calls the sled's "ensure state" endpoint to start the instance; this should now return right away without blocking on a response from the sled agent instance task
- Later, when the sled's Propolis zone comes up and Propolis starts, it will publish a transition to Running that will be handled by the normal means (sled agent will call
cpapi_instances_put
to change the VMM state and the instance's effective state will change accordingly)
I think this is more or less what the sled agent portion of the PR does today, but without the need for an additional Nexus task to enforce the start timeout or extra internal APIs to receive status updates from sled agents.
The main downside of doing this is of course that if something does go wrong in instance startup (e.g., the sled bounces), the instance will get stuck in Starting instead of going to Failed, and it's harder (read: next to impossible) to recover an instance that's stuck in Starting. I think that may be justifiable here, though, on the following reasoning:
- Having fewer instances go to Failed outweighs the marginal risk of an instance that's stuck in Starting
- Failed is better than Starting, but Failed instances are still pretty painful to deal with; you can delete them but that still leaks resources (per the discussion in and TODOs linked to Revisit moving instances to Failed in
handle_instance_put_result
#4226 and need a way to trigger cleanup and next steps for vanished instances #4872) - Sled/sled agent bounces are much rarer than long-running zone startup commands, so the pain we save from having fewer instances spuriously go to Failed outweighs the risk of them getting stuck in Starting
- Failed is better than Starting, but Failed instances are still pretty painful to deal with; you can delete them but that still leaks resources (per the discussion in and TODOs linked to Revisit moving instances to Failed in
- Simplifying this PR frees up some complexity budget that we can spend on Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state #4004, which if successful would give users a way to fix all "stuck in Starting"/"stuck in Stopping" problems (not just those caused by zone startup timeouts)
WDYT of this approach?
that certainly sounds like a better compromise! and later, if we have a nexus background task monitoring all the known instances for any who've been in Starting too long, we can act on it then with the existing calls. but yeah, i'm good with throwing out the return-trip of this as long as the resulting behavioral change of "we simply don't time out instance creation at all any more" is acceptable. |
d17c0a4
to
29544e2
Compare
sled-agent/src/instance_manager.rs
Outdated
|
||
match target { | ||
// these may involve a long-running zone creation, so avoid HTTP | ||
// request timeouts by decoupling the response | ||
// (see InstanceRunner::put_state) | ||
InstanceStateRequested::MigrationTarget(_) | ||
| InstanceStateRequested::Running => { | ||
// don't error on channel being closed | ||
tokio::spawn(rx); | ||
Ok(InstancePutStateResponse { updated_runtime: None }) | ||
} | ||
InstanceStateRequested::Stopped | ||
| InstanceStateRequested::Reboot => rx.await?, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the nexus-side out-of-band timeout stuff removed (and thanks to the recent restructure of InstanceManager to have its own worker task), the crux of this change is now just this block. (the rest being support for unit-testing sled-agent's Instance code at all, and changes to tests to accommodate the different behavior)
((i went ahead and "resolved" the comments that were about code that's now been reverted))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifning This looks really good. The tests are super thorough. I appreciate the hassle you had to go through to setup all the params to pass into the Instance constructor!
I only had a few minor quibbles. The ones that referenced @smklein he should take a quick look at. If he's fine leaving them I think this is good to merge once my other comments are resolved.
InstanceStateRequested::MigrationTarget(_) | ||
| InstanceStateRequested::Running => { | ||
// don't error on channel being closed | ||
tokio::spawn(rx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a clever way to do this!
Nit: I would probably enhance the comment to make it a bit clearer, such as: We don't want the sending side of the channel to see an error if we drop
rx without awaiting it. Since we don't care about the response here, we go ahead and spawn rx into a task which will await it for us in the background.
@@ -1193,29 +1193,55 @@ impl ZoneBuilderFactory { | |||
/// Created by [ZoneBuilderFactory]. | |||
#[derive(Default)] | |||
pub struct ZoneBuilder<'a> { | |||
/// Logger to which status messages are written during zone installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the additional info!
@@ -119,6 +119,19 @@ impl VmmReservoirManagerHandle { | |||
} | |||
rx.await.map_err(|_| Error::ReplySenderDropped)? | |||
} | |||
|
|||
#[cfg(test)] | |||
pub fn stub_for_test() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it :)
}; | ||
use omicron_common::api::internal::nexus::InstanceProperties; | ||
use sled_storage::disk::{RawDisk, SyntheticDisk}; | ||
use sled_storage::manager::FakeStorageManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
let (put_tx, put_rx) = oneshot::channel(); | ||
|
||
inst.put_state(put_tx, InstanceStateRequested::Running) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be the real start of the code, after all the setup. I think this test would be clearer if you indicated what was going on here. I believe this call to inst.put_state
is actually sending a request to start the instance to a fake propolis server which eventually acks that the instance is running by replying on put_rx
. A comment to that effect would help readers of the code.
use illumos_utils::opte::params::DhcpConfig; | ||
use illumos_utils::svc::__wait_for_service::Context as MockWaitForServiceContext; | ||
use illumos_utils::zone::MockZones; | ||
use illumos_utils::zone::__mock_MockZones::__boot::Context as MockZonesBootContext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we are largely trying to get rid of mocks, as they introduce global state across tests. I"m not sure if there is a better way to do this other than some direct fake trickery though. CC @smklein
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, i think ideally we eventually replace all the uses of mockall in illumos_utils with fakes that we can plumb through anything that uses them - which may be a bit cumbersome, as currently a great deal of that functionality (including what's used here) are static methods.
though if i understand correctly, nextest runs each test case in its own process, and if so these simple stubs returning a static value shouldn't be stepping on anything substantial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem that I ran into that was somewhat alluded to was #4481. It doesn't appear to be a problem specifically here though.
inst.put_state(put_tx, InstanceStateRequested::Running) | ||
.await | ||
.expect("failed to send Instance::put_state"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call tokio::time::advance
here, so that we don't have to wait for the full TIMEOUT_DURATION
? This will also allow us to increase that duration so without making the success case necessarily take longer, but still giving plenty of time so we don't get flaky runs in CI.
I think to do this, you'll have to create the timeout future, then advance the time, and then await the future.
sled-agent/src/instance.rs
Outdated
.await | ||
.expect("failed to send Instance::put_state"); | ||
|
||
timeout(TIMEOUT_DURATION, put_rx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, let's use tokio::time::advance.
At time of writing, instance creation roughly looks like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - `illumos_utils::svc::wait_for_service` - `self::wait_for_http_server` for propolis-server itself - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: return ok result - nexus: `handle_instance_put_result` Or at least, it does in the happy path. omicron#3927 saw propolis zone creation take longer than the minute nexus's call to sled-agent's `instance_put_state`. That might've looked something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - nexus: i've been waiting a whole minute for this. connection timeout! - nexus: `handle_instance_put_result` - sled-agent: [...] return... oh, they hung up. :( To avoid this timeout being implicit at the *Dropshot configuration* layer (that is to say, we should still have *some* timeout), we could consider a small refactor to make `instance_put_state` not a blocking call -- especially since it's already sending nexus updates on its progress via out-of-band `cpapi_instances_put` calls! That might look something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: spawn { - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (blocking!) - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call (With a way for nexus to cancel an instance creation by ID, and a timeout in sled-agent itself for terminating the attempt and reporting the failure back to nexus, and a shorter threshold for logging the event of an instance creation taking a long time.) Before such a change, though, we should really have some more tests around sled-agent's instance creation code at all! So here's a few.
Alleviating request timeouts occurring when propolis zone installation takes too long (Propolis zone installation took 81 seconds and caused instance start to time out oxidecomputer#3927) by making the zone installation not happen during a request handler. Since the instance creation request no longer blocks, we need to wait before proceeding in some cases where we had assumed that a successful return from the Nexus call meant the instance existed, e.g. test_instance_serial now polls for the instance's running state before attempting to send serial console data requests.
67efdae
to
d5da2d4
Compare
Restore instance tests added in #4691 so that they work with the latest updates in #5172. The tests now use the new `StorageManagerTestHarness` from #5172 instead of the since removed `FakeStorageManager`. Additionally, the dns server code from `sled-agent/src/fakes/nexus` is used rather than the reimplementation in #4691. This shrinks the amount of code added with these tests.