Better error reporting mechanism needed #335

pfmooney · 2023-03-09T16:17:48Z

When running a guest, there are certain conditions which may cause one or more of the vCPUs to exit from VM context with an event we are unable to properly handle. Notable examples of this include #333 and #300, where the guests appear to have jumped off into "space", leaving the instruction fetch/decode/emulate machinery unable to do its job.

The course of action we choose in such situations can have certain trade-offs. The current behavior of aborting the propolis process has the advantage of attempting to preserve the maximum state of both the userspace emulation (saved in the core) as well as the kernel VMM and instance state (residing in the vmm device instance, as long as it is not removed). This may be beneficial for developer applications, but for running hosts in production, it is likely less than ideal.

Consumers in production likely expect a VM encountering an error condition like that to reboot, as if it had tripped over something like a triple-fault on a CPU. Rebooting the instance promptly at least allows it to return to service quickly. In such cases, we need to think about what bits of state we would want preserved from the machine and the fault conditions so it can be used for debugging later. In addition to the details about the vm-exit on the faulting vCPU(s), we could export the entire emulated device state (not counting DRAM) as if a migration were occurring. Customer policy could potentially choose to prune that down, or even augment it with additional state from the guest (perhaps the page of memory underlying %rip at the time of exit?)

With such a mechanism in place, we could still preserve the abort-on-unhandled-vmexit behavior if it is desired by developer workflows, but default to the more graceful mechanism for all other cases.

The text was updated successfully, but these errors were encountered:

hawkw · 2024-09-01T22:32:22Z

There's work currently in progress in the control plane to allow Nexus to automatically restart instances whose propolis-server has crashed (if configured to do so). In particular, oxidecomputer/omicron#6455 moves instances to the Failed state when their VMM has crashed, and oxidecomputer/omicron#6503 will add a RPW for restarting Failed instances, if they have an "auto-restart" configuration set.

Potentially, we could leverage that here and just allow propolis-servers that encounter this kind of guest misbehavior to crash and leave behind a core dump, knowing that the control plane will restart the instance if that's what the user wanted. On the other hand, this is potentially less efficient than restarting the guest within the same propolis-server, since it requires the control plane to spin up a whole new VMM and start the instance there. But, I figured it was worth mentioning!

pfmooney · 2024-09-01T22:53:29Z

In the case of #755 (and similar circumstances), I don't think that crashing is at all ideal. If we have a mechanism for surfacing information for support, it's probably more sensible to collect additional information about the state of the guest (registers, etc), since figuring that out from the propolis core dump alone with be challenging, if not impossible.

morlandi7 added this to the MVP milestone Mar 10, 2023

hawkw mentioned this issue Jan 11, 2024

add guest event log API #600

Open

pfmooney mentioned this issue Sep 1, 2024

Propolis server panicked after hitting some ???????? rust_panic call stack #755

Open

gjcolombo mentioned this issue Oct 17, 2024

lib: log vCPU diagnostics on triple fault and for some unhandled exit types #795

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better error reporting mechanism needed #335

Better error reporting mechanism needed #335

pfmooney commented Mar 9, 2023

hawkw commented Sep 1, 2024

pfmooney commented Sep 1, 2024

Better error reporting mechanism needed #335

Better error reporting mechanism needed #335

Comments

pfmooney commented Mar 9, 2023

hawkw commented Sep 1, 2024

pfmooney commented Sep 1, 2024