Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled-agent: don't block during instance creation request from nexus #4691

Merged

Commits on Mar 16, 2024

  1. Add some unit tests for sled-agent Instance creation

    At time of writing, instance creation roughly looks like:
    
    - nexus -> sled-agent: `instance_put_state`
      - sled-agent: `InstanceManager::ensure_state`
        - sled-agent: `Instance::propolis_ensure`
          - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
          - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
            - `RunningZone::install` and `Zones::boot`
            - `illumos_utils::svc::wait_for_service`
            - `self::wait_for_http_server` for propolis-server itself
          - sled-agent: `Instance::ensure_propolis_and_tasks`
            - sled-agent: spawn `Instance::monitor_state_task`
          - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
        - sled-agent: return ok result
    - nexus: `handle_instance_put_result`
    
    Or at least, it does in the happy path. omicron#3927 saw propolis zone
    creation take longer than the minute nexus's call to sled-agent's
    `instance_put_state`. That might've looked something like:
    
    - nexus -> sled-agent: `instance_put_state`
      - sled-agent: `InstanceManager::ensure_state`
        - sled-agent: `Instance::propolis_ensure`
          - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
          - sled-agent: `Instance::setup_propolis_locked` (*blocking!*)
            - `RunningZone::install` and `Zones::boot`
    - nexus: i've been waiting a whole minute for this. connection timeout!
    - nexus: `handle_instance_put_result`
        - sled-agent: [...] return... oh, they hung up. :(
    
    To avoid this timeout being implicit at the *Dropshot configuration*
    layer (that is to say, we should still have *some* timeout),
    we could consider a small refactor to make `instance_put_state` not a
    blocking call -- especially since it's already sending nexus updates on
    its progress via out-of-band `cpapi_instances_put` calls! That might look
    something like:
    
    - nexus -> sled-agent: `instance_put_state`
      - sled-agent: `InstanceManager::ensure_state`
        - sled-agent: spawn {
          - sled-agent: `Instance::propolis_ensure`
            - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
            - sled-agent: `Instance::setup_propolis_locked` (blocking!)
            - sled-agent: `Instance::ensure_propolis_and_tasks`
              - sled-agent: spawn `Instance::monitor_state_task`
            - sled-agent -> nexus: `cpapi_instances_put` (if not migrating)
            - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call
    
    (With a way for nexus to cancel an instance creation by ID, and a timeout
    in sled-agent itself for terminating the attempt and reporting the failure
    back to nexus, and a shorter threshold for logging the event of an instance
    creation taking a long time.)
    
    Before such a change, though, we should really have some more tests around
    sled-agent's instance creation code at all! So here's a few.
    lif committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    c0ba701 View commit details
    Browse the repository at this point in the history
  2. sled-agent: don't block during instance creation request from nexus

    Alleviating request timeouts occurring when propolis zone installation takes too long
    (Propolis zone installation took 81 seconds and caused instance start to time out oxidecomputer#3927)
    by making the zone installation not happen during a request handler.
    
    Since the instance creation request no longer blocks, we need to wait before proceeding in some cases where we had assumed that a successful return from the Nexus call meant the instance existed,
    e.g. test_instance_serial now polls for the instance's running state before attempting to send serial console data requests.
    lif committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    a3da136 View commit details
    Browse the repository at this point in the history
  3. post-rebase updates

    lif committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    a007503 View commit details
    Browse the repository at this point in the history
  4. fmt

    lif committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    b026a35 View commit details
    Browse the repository at this point in the history
  5. clean up fake zone files after tests

    lif committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    71de433 View commit details
    Browse the repository at this point in the history
  6. fixup

    lif committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    d55db5b View commit details
    Browse the repository at this point in the history
  7. cleanup fmt stable oddity

    lif committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    aa9da2c View commit details
    Browse the repository at this point in the history
  8. use tokio::time::advance

    lif committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    d5da2d4 View commit details
    Browse the repository at this point in the history