-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When crucible is unreachable instance-create saga can not unwind #5022
Comments
The bug that showed similar behavior is #4331 but I think this may be a new issue. |
I believe it's the same issue as #4331 - the code in sagas can only proceed when there's a known result of some kind for a call to an external service, positive or negative, and both issues are a result of trying to get a known result out of unreachable services. |
Nexus will retry in a loop, which is taken care of by the retry_until_known_result function. The message
comes from the section of that function that responds to Err(progenitor_client::Error::CommunicationError(e)) => {
warn!(
log,
"saw transient communication error {}, retrying...",
e,
);
Err(backoff::BackoffError::transient(
progenitor_client::Error::CommunicationError(e),
))
} It would be incorrect to treat a timeout as a permanent problem unless Nexus knows the service is never coming back. For example, if there's a transient network problem, the saga will wait until that is resolved in this backoff loop. The problem comes from when services won't ever come back, so this saga will never resolve, which is what this issue is directly seeing. It makes sense that there should be a bailout of the backoff retry, if it could be known that the timeouts were as result of something non-transient. |
For permanent issues, there's #4259. If this is expected for transient issues, maybe this issue is a dup of that one? |
We discussed on today's update call whether this issue is a blocker for R8. It is currently marked as such and nobody's assigned to work on it. We did not come to any conclusion. My hope was that this wasn't very relevant for R8 because I assumed (1) the instance-create saga is only trying to reach Crucible on sleds where a disk was provisioned in the same saga , and (2) during our initial use of sled expungement, we won't have in-progress instance creations; and if we did, they won't be creating disks on this sled. I'm not sure if these assumptions are true and nobody else weighed in so I'm not sure how essential this is for R8. @askfongjojo mentioned that we could also document recovery steps if this were to happen. |
If there's a call to an external service, saga execution cannot move forward until the result of that call is known, in the sense that Nexus received a result. If there are transient problems, Nexus must retry until a known result is returned. This is problematic when the destination service is gone - Nexus will retry indefinitely, halting the saga execution. Worse, in the case of sagas calling the volume delete subsaga, subsequent calls will also halt. With the introduction of a physical disk policy, Nexus can know when to stop retrying a call - the destination service is gone, so the known result is an error. This commit adds a `ProgenitorOperationRetry` object that takes an operation to retry plus a "gone" check, and checks each retry iteration if the destination is gone. If it is, then bail out, otherwise assume that any errors seen are transient. Further work is required to deprecate the `retry_until_known_result` function, as retrying indefinitely is a bad pattern. Fixes oxidecomputer#4331 Fixes oxidecomputer#5022
Feel free to close if this is a duplicate as I fell like this was discussed before, but I could not find the exact issue.
When creating an instance, if the
instance-create
saga requests a region from an unavailable crucible zone the saga will eventually fail with a timeout and attempt to clean up any regions that were created during the unwind. During this unwind it will again try to reach out to the region that timed out as it does not know if the timeout resulted in a region or not. If the region is fully unreachable and not coming back, the saga will be hang repeatedly trying to reach the zone:This looks to occur in the unwind of the
disk-create
sub-saga when it attempts to delete crucible regions. The delete will loop forever until it is able to access the requested region.To an end user this is reported as an instance stuck in the
creating
state, even though it is guaranteed at this point to never to complete (i.e. is it trying to destroy itself as opposed to start up).The text was updated successfully, but these errors were encountered: