Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When crucible is unreachable instance-create saga can not unwind #5022

Closed
augustuswm opened this issue Feb 7, 2024 · 5 comments · Fixed by #5789
Closed

When crucible is unreachable instance-create saga can not unwind #5022

augustuswm opened this issue Feb 7, 2024 · 5 comments · Fixed by #5789
Milestone

Comments

@augustuswm
Copy link
Contributor

Feel free to close if this is a duplicate as I fell like this was discussed before, but I could not find the exact issue.

When creating an instance, if the instance-create saga requests a region from an unavailable crucible zone the saga will eventually fail with a timeout and attempt to clean up any regions that were created during the unwind. During this unwind it will again try to reach out to the region that timed out as it does not know if the timeout resulted in a region or not. If the region is fully unreachable and not coming back, the saga will be hang repeatedly trying to reach the zone:

21:43:09.252Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::SiloUser { silo_user_id: fe770c10-0ad3-4d15-96e5-1b7541aa1591, silo_id: 3274bfe0-2cef-4983-9315-740cf1e4ad19, .. })
    actor_id = fe770c10-0ad3-4d15-96e5-1b7541aa1591
    authenticated = true
    resource = Database
    result = Ok(())
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
    saga_node = SpaceAccount
21:43:09.261Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Succeeded(Null)
    node_id = 171
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:09.263Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Started
    node_id = 172
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:09.265Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Succeeded(Null)
    node_id = 172
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:09.267Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Started
    node_id = 173
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:09.383Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 272.667089ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:09.385Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 223.878843ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:09.611Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 423.714892ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:09.657Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 402.61976ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:10.036Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 1.492108863s
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:10.061Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 1.45481525s
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.382Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Failed(ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("Communication Error: error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions): operation timed out")}} })
    node_id = 173
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:24.385Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): update for saga cached state
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.0/src/sec.rs:1332
    new_state = Unwinding
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    sec_id = 95ebe94d-0e68-421d-9260-c30bd7fe4bd6
21:43:24.385Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): updating state
    file = nexus/db-queries/src/db/sec_store.rs:72
    new_state = unwinding
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:24.401Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = UndoStarted
    node_id = 172
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:24.404Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): sdc_regions_ensure_undo: Deleting crucible regions
    file = nexus/src/app/sagas/disk_create.rs:545
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.548Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 146.578465ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.549Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 54.971078ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.606Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 136.157882ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.696Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 177.300212ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.743Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 192.446679ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.875Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): region deleted
    file = nexus/src/app/sagas/common_storage.rs:275
    region_id = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.938Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): region deleted
    file = nexus/src/app/sagas/common_storage.rs:275
    region_id = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:39.516Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions/71d2cb1c-2ca1-40d0-9f53-98a9b76df51a): operation timed out, retrying...
    file = nexus/src/app/sagas/mod.rs:358
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:39.516Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): failed external call (Communication Error: error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions/71d2cb1c-2ca1-40d0-9f53-98a9b76df51a): operation timed out), will retry in 299.584333ms
    file = nexus/src/app/sagas/mod.rs:403
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:54.818Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions/71d2cb1c-2ca1-40d0-9f53-98a9b76df51a): operation timed out, retrying...
    file = nexus/src/app/sagas/mod.rs:358
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create

This looks to occur in the unwind of the disk-create sub-saga when it attempts to delete crucible regions. The delete will loop forever until it is able to access the requested region.

To an end user this is reported as an instance stuck in the creating state, even though it is guaranteed at this point to never to complete (i.e. is it trying to destroy itself as opposed to start up).

@askfongjojo
Copy link

The bug that showed similar behavior is #4331 but I think this may be a new issue.

@morlandi7 morlandi7 added this to the 6 milestone Feb 8, 2024
@jmpesp
Copy link
Contributor

jmpesp commented Feb 8, 2024

I believe it's the same issue as #4331 - the code in sagas can only proceed when there's a known result of some kind for a call to an external service, positive or negative, and both issues are a result of trying to get a known result out of unreachable services.

@jmpesp
Copy link
Contributor

jmpesp commented Feb 8, 2024

Nexus will retry in a loop, which is taken care of by the retry_until_known_result function. The message

21:43:39.516Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions/71d2cb1c-2ca1-40d0-9f53-98a9b76df51a): operation timed out, retrying...
    file = nexus/src/app/sagas/mod.rs:358
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create

comes from the section of that function that responds to progenitor_client::Error::CommunicationError:

                    Err(progenitor_client::Error::CommunicationError(e)) => {
                        warn!(
                            log,
                            "saw transient communication error {}, retrying...",
                            e,
                        );

                        Err(backoff::BackoffError::transient(
                            progenitor_client::Error::CommunicationError(e),
                        ))
                    }

It would be incorrect to treat a timeout as a permanent problem unless Nexus knows the service is never coming back. For example, if there's a transient network problem, the saga will wait until that is resolved in this backoff loop.

The problem comes from when services won't ever come back, so this saga will never resolve, which is what this issue is directly seeing.

It makes sense that there should be a bailout of the backoff retry, if it could be known that the timeouts were as result of something non-transient.

@davepacheco
Copy link
Collaborator

For permanent issues, there's #4259. If this is expected for transient issues, maybe this issue is a dup of that one?

@morlandi7 morlandi7 modified the milestones: 6, 7 Feb 26, 2024
@morlandi7 morlandi7 modified the milestones: 7, 8 Mar 12, 2024
@davepacheco
Copy link
Collaborator

We discussed on today's update call whether this issue is a blocker for R8. It is currently marked as such and nobody's assigned to work on it. We did not come to any conclusion. My hope was that this wasn't very relevant for R8 because I assumed (1) the instance-create saga is only trying to reach Crucible on sleds where a disk was provisioned in the same saga , and (2) during our initial use of sled expungement, we won't have in-progress instance creations; and if we did, they won't be creating disks on this sled. I'm not sure if these assumptions are true and nobody else weighed in so I'm not sure how essential this is for R8. @askfongjojo mentioned that we could also document recovery steps if this were to happen.

@morlandi7 morlandi7 modified the milestones: 8, 9 May 13, 2024
jmpesp added a commit to jmpesp/omicron that referenced this issue May 17, 2024
If there's a call to an external service, saga execution cannot move
forward until the result of that call is known, in the sense that Nexus
received a result. If there are transient problems, Nexus must retry
until a known result is returned.

This is problematic when the destination service is gone - Nexus will
retry indefinitely, halting the saga execution. Worse, in the case of
sagas calling the volume delete subsaga, subsequent calls will also
halt.

With the introduction of a physical disk policy, Nexus can know when to
stop retrying a call - the destination service is gone, so the known
result is an error.

This commit adds a `ProgenitorOperationRetry` object that takes an
operation to retry plus a "gone" check, and checks each retry iteration
if the destination is gone. If it is, then bail out, otherwise assume
that any errors seen are transient.

Further work is required to deprecate the `retry_until_known_result`
function, as retrying indefinitely is a bad pattern.

Fixes oxidecomputer#4331
Fixes oxidecomputer#5022
@jmpesp jmpesp closed this as completed in b07382f May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants