Instance start/delete sagas hang while sleds are unreachable #4259

gjcolombo · 2023-10-11T18:50:08Z

Repro environment: Seen on rack3 after a sled trapped into kmdb and became inoperable.

When an instance starts, the start saga calls Nexus::create_instance_v2p_mappings to ensure that every sled in the cluster knows how to route traffic directed to the instance's virtual IPs. This function calls sled_list to get the list of active sleds and then invokes the sled agent's set_v2p endpoint on each one. The calls to set_v2p are wrapped in a retry_until_known_result wrapper that treats Progenitor communication errors (including client timeouts) as transient errors requiring an operation to be retried (consider, for example, a request to do X that sled agent receives and begins processing but does not finish processing until Nexus has decided not to wait anymore; if this produces an error that unwinds the saga, X will not be undone, because a failure in a saga only undoes steps that previously completed successfully, not the one that produced the failure). Instance deletion does something similar to all this via delete_instance_v2p_mappings.

Rack3 has a sled that keeps panicking with symptoms of a known host OS issue. To better identify the problem, we set the sled up to drop into kmdb on panicking instead of rebooting. This rendered the sled's agent totally and permanently unresponsive. Since retry_until_known_result treats progenitor_client::Error::CommunicationErrors as transient errors, this caused all subsequent instance creation and deletion attempts to get stuck retrying the same attempt to edit V2P mappings on the sled being debugged, causing the relevant instances to get stuck in the Creating/Stopped states (soon to be the Starting/Stopped states once 4194 lands).

There are several things to unpack here (probably into their own issues):

retry_until_known_result doesn't have a way to bail out after a certain amount of time/number of attempts; even if it did, such a bailout would have to respect the undo rules for sagas described above
There's no way to recover when an instance gets stuck in a creating/starting state: the instance can't be stopped or destroyed, and even if these state transitions were allowed or there were some other way to "reset" the instance to a stopped state (see Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state #4004), there's no way to cooperatively cancel the instance's ongoing saga.
Recovering from the "bad sled" case is very challenging:
- There's currently no way to mark a sled as unhealthy or out-of-service; there's a time_deleted column in the sleds table, but the datastore's sled_list function doesn't filter on it, so create_instance_v2p_mappings won't ignore deleted sleds.
- Even if sled_list did ignore unhealthy sleds, there's still a race where create_instance_v2p_mappings decides to start talking to a sled before it's marked as unhealthy and never reconsiders that decision. (This feeds back into the first two items in this list.)

The text was updated successfully, but these errors were encountered:

morlandi7 · 2023-12-21T17:14:06Z

@internet-diglett Is there some alignment here with the other RPW work? (#4715 )

internet-diglett · 2024-01-02T19:07:33Z

@internet-diglett Is there some alignment here with the other RPW work? (#4715 )

@morlandi7 at the moment, I don't believe so unless there has been a decision to move v2p mappings to an RPW based model. @gjcolombo have there been any discussions about this?

askfongjojo · 2024-01-16T17:51:03Z

@internet-diglett - Perhaps a useful fix for now is modify the sled inclusion criteria to consider the time_deleted value. The change seems valid regardless of how we want to handle unresponsive sleds in general.

In situations like OS panic or sled-agent restart, we've seen in one customer's case that the saga was able to resume/complete once the problem sled came back up (not ideal but also not too bad). There are cases in which the sleds are out indefinitely but we'll take the necessary time to solve them through other ways.

davepacheco · 2024-01-26T19:25:49Z

Is there anything in the system that sets time_deleted on a sled today? I wouldn't have thought so. I'd suggest we use the policy field proposed in RFD 457 instead but that's basically the same idea, and I think it's a good idea, but has the same problem that I don't think it would help this problem in practice until we actually implement support for sled removal. It'd be tempting to use provision_state, but I don't think that's quite right because there might still be instances on a sled to which provisioning new instances is currently disabled.

(The rest of this is probably repeating what folks already know but it took me a while to understand the discussion above so I'm summarizing here for myself and others that might be confused.)

I think it's important to distinguish three cases:

sleds that are transiently unavailable,
sleds that are unavailable for an extended period (say, more than a few minutes) but we don't know if they're coming back, and
sleds that are permanently unavailable (which means an operator has told us that it's not coming back)

I can see how if a sled is unreachable for several minutes, we don't want all instance start/stop for instances on that sled to hang, and certainly not all start/stop for all instances. But we also don't want to give up forever. It might still have instances on it, it might come back, and it may need that v2p update, right? So I can see why we're asking about an RPW. I'm not that familiar with create_instance_v2p_mappings, but yeah, it sounds like an RPW may well be a better fit than a saga step. The RPW would do its best to update all sleds that it can. But if it can't reach some, no sweat -- it'll try again the next time the RPW is activated. And we can use the same pattern we use with other RPWs to report status (e.g., to omdb) about which sleds we've been able to keep updated to which set of v2p mappings.

davepacheco · 2024-01-26T19:28:27Z

@gjcolombo would you object to retitling this Instance start/delete sagas hang while sleds are unreachable?
(edit: confirmed no objection offline)

davepacheco · 2024-04-23T22:03:43Z

In today's update call, we discussed whether this was a blocker for R8. The conclusion is "no" because this should not be made any worse during sled expungement. The sled we plan to expunge in R8 is not running any instances and so should not need to have its v2p mappings updated as part of instance create/delete sagas. Beyond that, all instances are generally stopped before the maintenance window starts, and when they start again, the sled will be expunged and so not included in the list of sleds to update.

internet-diglett · 2024-05-30T02:04:20Z

@morlandi7 this should be resolved, but I left it open until someone verifies the work done in #5568 has actually resolved this issue on dogfood.

askfongjojo · 2024-06-28T23:13:48Z

Checked the current behavior on rack2: I put sled 23 to A2 and provisioned a bunch of instances. All of them stayed in starting state (they didn't transition to running after the sled was brought back to A0 and that's a different problem to be investigated).

According to https://github.com/oxidecomputer/omicron/blob/main/nexus/src/app/sagas/instance_start.rs#L61-L62, which in turn references #3879, it looks like fixing this requires one (hopefully small) lift.

internet-diglett · 2024-07-01T16:16:25Z

@askfongjojo I think that is an old comment that didn't get removed, as that saga node already has been updated (through a series of function calls) to use the nat rpw. Do you have the instance ids / any identifying information so I can check the logs to see what caused it to hang?

askfongjojo · 2024-07-01T19:44:15Z

Ah you are right. I retested just now and had no problem bringing up instances when one of the sled is offline. I probably ran into an issue related to some bad downstairs when I tested that last time. This time I'm testing with a brand new disk snapshot to avoid hitting the bad downstairs problem again.

askfongjojo added this to the MVP milestone Oct 12, 2023

morlandi7 modified the milestones: MVP, 6 Nov 27, 2023

askfongjojo mentioned this issue Jan 10, 2024

what's involved with removing a sled? #4787

Closed

morlandi7 modified the milestones: 6, 7 Jan 26, 2024

davepacheco changed the title ~~Instance start/delete sagas will spin forever if a sled is permanently unreachable~~ Instance start/delete sagas hang while sleds are unreachable Jan 26, 2024

davepacheco mentioned this issue Feb 9, 2024

When crucible is unreachable instance-create saga can not unwind #5022

Closed

hawkw self-assigned this Feb 20, 2024

askfongjojo mentioned this issue Mar 8, 2024

New instances in a rebooted sled are unable to reach existing instances in other sleds on their private IPs #5214

Closed

askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024

morlandi7 modified the milestones: 7, 8 Mar 12, 2024

hawkw removed their assignment Mar 28, 2024

davepacheco mentioned this issue Apr 19, 2024

RPW for OPTE v2p Mappings #5568

Merged

7 tasks

askfongjojo assigned internet-diglett May 7, 2024

askfongjojo modified the milestones: 8, 9 May 7, 2024

internet-diglett closed this as completed in 2082942 May 22, 2024

internet-diglett closed this as completed in #5568 May 22, 2024

internet-diglett reopened this May 22, 2024

askfongjojo closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance start/delete sagas hang while sleds are unreachable #4259

Instance start/delete sagas hang while sleds are unreachable #4259

gjcolombo commented Oct 11, 2023 •

edited

Loading

morlandi7 commented Dec 21, 2023 •

edited

Loading

internet-diglett commented Jan 2, 2024

askfongjojo commented Jan 16, 2024

davepacheco commented Jan 26, 2024

davepacheco commented Jan 26, 2024 •

edited

Loading

davepacheco commented Apr 23, 2024

internet-diglett commented May 30, 2024

askfongjojo commented Jun 28, 2024 •

edited

Loading

internet-diglett commented Jul 1, 2024

askfongjojo commented Jul 1, 2024

Instance start/delete sagas hang while sleds are unreachable #4259

Instance start/delete sagas hang while sleds are unreachable #4259

Comments

gjcolombo commented Oct 11, 2023 • edited Loading

morlandi7 commented Dec 21, 2023 • edited Loading

internet-diglett commented Jan 2, 2024

askfongjojo commented Jan 16, 2024

davepacheco commented Jan 26, 2024

davepacheco commented Jan 26, 2024 • edited Loading

davepacheco commented Apr 23, 2024

internet-diglett commented May 30, 2024

askfongjojo commented Jun 28, 2024 • edited Loading

internet-diglett commented Jul 1, 2024

askfongjojo commented Jul 1, 2024

gjcolombo commented Oct 11, 2023 •

edited

Loading

morlandi7 commented Dec 21, 2023 •

edited

Loading

davepacheco commented Jan 26, 2024 •

edited

Loading

askfongjojo commented Jun 28, 2024 •

edited

Loading