Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propolis zone installation took 81 seconds and caused instance start to time out #3927

Closed
gjcolombo opened this issue Aug 21, 2023 · 0 comments · Fixed by #4691
Closed

Propolis zone installation took 81 seconds and caused instance start to time out #3927

gjcolombo opened this issue Aug 21, 2023 · 0 comments · Fixed by #4691
Assignees
Milestone

Comments

@gjcolombo
Copy link
Contributor

Seen in selfhost. Relevant log snippet:

21:19:35.260Z INFO SledAgent (InstanceManager): ensuring instance is registered
    file = sled-agent/src/instance_manager.rs:174
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
    propolis_id = 0f8796d9-c4df-4c72-89a2-2fd231fc395e
21:19:35.261Z INFO SledAgent (InstanceManager): registering new instance
    file = sled-agent/src/instance_manager.rs:204
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
21:19:36.043Z INFO SledAgent (InstanceManager): Ensuring new instance
    file = sled-agent/src/instance.rs:725
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
21:19:36.162Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:275
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
    state = InstanceRuntimeState { run_state: Starting, sled_id: ca79c08b-d31c-4769-95dc-df47aa190475, propolis_id: 0f8796d9-c4df-4c72-89a2-2fd231fc395e, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:105::53]:12400), migration_id: None, propolis_gen: Generation(1), ncpus: InstanceCpuCount(32), memory: ByteCount(68719476736), hostname: "hubris", gen: Generation(2), time_updated: 2023-08-21T21:19:36.043175230Z }
21:19:38.477Z INFO SledAgent (InstanceManager): Configuring new Omicron zone: oxz_propolis-server_0f8796d9-c4df-4c72-89a2-2fd231fc395e
    file = illumos-utils/src/zone.rs:314
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
21:19:42.189Z INFO SledAgent (InstanceManager): Installing Omicron zone: oxz_propolis-server_0f8796d9-c4df-4c72-89a2-2fd231fc395e
    file = illumos-utils/src/zone.rs:353
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
21:20:39.371Z INFO SledAgent (InstanceManager): Profile for oxz_propolis-server_0f8796d9-c4df-4c72-89a2-2fd231fc395e:
    <!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
    <service_bundle type="profile" name="omicron">
      <service version="1" type="service" name="system/illumos/propolis-server">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="datalink" value="oxControlInstance10"/>
            <propval type="astring" name="gateway" value="fd00:1122:3344:105::1"/>
            <propval type="astring" name="listen_addr" value="fd00:1122:3344:105::53"/>
            <propval type="astring" name="listen_port" value="12400"/>
            <propval type="astring" name="metric_addr" value="[fd00:1122:3344:10a::3]:12221"/>
          </property_group>
        </instance>
      </service>
    </service_bundle>
    file = sled-agent/src/profile.rs:34
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
21:20:39.386Z INFO SledAgent (InstanceManager): Zone booting
    file = illumos-utils/src/running_zone.rs:498
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
    zone = oxz_propolis-server_0f8796d9-c4df-4c72-89a2-2fd231fc395e
21:20:54.002Z INFO SledAgent (InstanceManager): Started propolis in zone: oxz_propolis-server_0f8796d9-c4df-4c72-89a2-2fd231fc395e
    file = sled-agent/src/instance.rs:970
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
21:20:54.998Z INFO SledAgent (InstanceManager): Propolis SMF service is online
    file = sled-agent/src/instance.rs:979
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
21:20:56.887Z INFO SledAgent (InstanceManager): Propolis HTTP server online
    file = sled-agent/src/instance.rs:996
    instance_id = d49bdb29-700a-4714-9501-d9626c1e89b2
21:20:57.150Z INFO SledAgent (InstanceManager): Sending ensure request to propolis

Sled agent installs Propolis zones synchronously in the course of handling an instance state PUT that requires a zone to be created. Nexus's sled agent client has a 60-second timeout on all calls, so if zone installation and Propolis VM setup combine to take more than 60 seconds, instance start will fail.

This was a moderately large instance (32 vCPUs/64 GiB). However, I think this is different from the large-instance provisioning case described in #3417 and the circumstances contemplated in oxidecomputer/propolis#471, because the problem should be independent of the VM's shape or configuration--this sled agent wasn't waiting for Propolis to allocate memory or connect to Crucible because there was no Propolis yet.

@askfongjojo askfongjojo added this to the 1.0.3 milestone Aug 22, 2023
@lifning lifning self-assigned this Aug 24, 2023
@askfongjojo askfongjojo modified the milestones: 1.0.3, 3 Sep 1, 2023
@askfongjojo askfongjojo modified the milestones: 3, 4 Oct 21, 2023
@morlandi7 morlandi7 modified the milestones: 4, 5 Nov 14, 2023
@morlandi7 morlandi7 modified the milestones: 5, 6 Dec 4, 2023
@morlandi7 morlandi7 linked a pull request Jan 26, 2024 that will close this issue
@morlandi7 morlandi7 modified the milestones: 6, 7 Jan 31, 2024
lifning pushed a commit to lifning/omicron that referenced this issue Mar 13, 2024
Alleviating request timeouts occurring when propolis zone installation takes too long
(Propolis zone installation took 81 seconds and caused instance start to time out oxidecomputer#3927)
by making the zone installation not happen during a request handler.

Since the instance creation request no longer blocks, we need to wait before proceeding in some cases where we had assumed that a successful return from the Nexus call meant the instance existed,
e.g. test_instance_serial now polls for the instance's running state before attempting to send serial console data requests.
lifning pushed a commit to lifning/omicron that referenced this issue Mar 16, 2024
Alleviating request timeouts occurring when propolis zone installation takes too long
(Propolis zone installation took 81 seconds and caused instance start to time out oxidecomputer#3927)
by making the zone installation not happen during a request handler.

Since the instance creation request no longer blocks, we need to wait before proceeding in some cases where we had assumed that a successful return from the Nexus call meant the instance existed,
e.g. test_instance_serial now polls for the instance's running state before attempting to send serial console data requests.
lifning added a commit that referenced this issue Mar 19, 2024
…4691)

Alleviating request timeouts occurring when propolis zone installation
takes too long - #3927 - by making the zone installation not happen
during a request handler, and instead having sled-agent report the
result of instance creation back to nexus with an internal cpapi call.

In this world where instance create calls to nexus now no longer block
on sled-agent's propolis zone installation finishing, care must be taken
in e.g. integration tests not to assume you can, say, connect to the
instance as soon as the instance create call returns.

Accordingly, integration tests have been changed to poll until the
instance is ready, and additionally some unit tests have been added
for instance creation in sled-agent's Instance and InstanceManager.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants