-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propolis zone installation took 81 seconds and caused instance start to time out #3927
Milestone
Comments
lifning
pushed a commit
to lifning/omicron
that referenced
this issue
Mar 13, 2024
Alleviating request timeouts occurring when propolis zone installation takes too long (Propolis zone installation took 81 seconds and caused instance start to time out oxidecomputer#3927) by making the zone installation not happen during a request handler. Since the instance creation request no longer blocks, we need to wait before proceeding in some cases where we had assumed that a successful return from the Nexus call meant the instance existed, e.g. test_instance_serial now polls for the instance's running state before attempting to send serial console data requests.
lifning
pushed a commit
to lifning/omicron
that referenced
this issue
Mar 16, 2024
Alleviating request timeouts occurring when propolis zone installation takes too long (Propolis zone installation took 81 seconds and caused instance start to time out oxidecomputer#3927) by making the zone installation not happen during a request handler. Since the instance creation request no longer blocks, we need to wait before proceeding in some cases where we had assumed that a successful return from the Nexus call meant the instance existed, e.g. test_instance_serial now polls for the instance's running state before attempting to send serial console data requests.
lifning
added a commit
that referenced
this issue
Mar 19, 2024
…4691) Alleviating request timeouts occurring when propolis zone installation takes too long - #3927 - by making the zone installation not happen during a request handler, and instead having sled-agent report the result of instance creation back to nexus with an internal cpapi call. In this world where instance create calls to nexus now no longer block on sled-agent's propolis zone installation finishing, care must be taken in e.g. integration tests not to assume you can, say, connect to the instance as soon as the instance create call returns. Accordingly, integration tests have been changed to poll until the instance is ready, and additionally some unit tests have been added for instance creation in sled-agent's Instance and InstanceManager.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Seen in selfhost. Relevant log snippet:
Sled agent installs Propolis zones synchronously in the course of handling an instance state PUT that requires a zone to be created. Nexus's sled agent client has a 60-second timeout on all calls, so if zone installation and Propolis VM setup combine to take more than 60 seconds, instance start will fail.
This was a moderately large instance (32 vCPUs/64 GiB). However, I think this is different from the large-instance provisioning case described in #3417 and the circumstances contemplated in oxidecomputer/propolis#471, because the problem should be independent of the VM's shape or configuration--this sled agent wasn't waiting for Propolis to allocate memory or connect to Crucible because there was no Propolis yet.
The text was updated successfully, but these errors were encountered: