When crucible is unreachable instance-create saga can not unwind #5022

augustuswm · 2024-02-07T22:27:03Z

Feel free to close if this is a duplicate as I fell like this was discussed before, but I could not find the exact issue.

When creating an instance, if the instance-create saga requests a region from an unavailable crucible zone the saga will eventually fail with a timeout and attempt to clean up any regions that were created during the unwind. During this unwind it will again try to reach out to the region that timed out as it does not know if the timeout resulted in a region or not. If the region is fully unreachable and not coming back, the saga will be hang repeatedly trying to reach the zone:

21:43:09.252Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::SiloUser { silo_user_id: fe770c10-0ad3-4d15-96e5-1b7541aa1591, silo_id: 3274bfe0-2cef-4983-9315-740cf1e4ad19, .. })
    actor_id = fe770c10-0ad3-4d15-96e5-1b7541aa1591
    authenticated = true
    resource = Database
    result = Ok(())
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
    saga_node = SpaceAccount
21:43:09.261Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Succeeded(Null)
    node_id = 171
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:09.263Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Started
    node_id = 172
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:09.265Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Succeeded(Null)
    node_id = 172
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:09.267Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Started
    node_id = 173
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:09.383Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 272.667089ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:09.385Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 223.878843ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:09.611Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 423.714892ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:09.657Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 402.61976ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:10.036Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 1.492108863s
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:10.061Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Region requested, not yet created. Retrying in 1.45481525s
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.382Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = Failed(ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("Communication Error: error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions): operation timed out")}} })
    node_id = 173
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:24.385Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): update for saga cached state
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.0/src/sec.rs:1332
    new_state = Unwinding
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    sec_id = 95ebe94d-0e68-421d-9260-c30bd7fe4bd6
21:43:24.385Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): updating state
    file = nexus/db-queries/src/db/sec_store.rs:72
    new_state = unwinding
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:24.401Z DEBG 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): recording saga event
    event_type = UndoStarted
    node_id = 172
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
21:43:24.404Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): sdc_regions_ensure_undo: Deleting crucible regions
    file = nexus/src/app/sagas/disk_create.rs:545
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.548Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 146.578465ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.549Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 54.971078ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.606Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 136.157882ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.696Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 177.300212ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.743Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): Transient(region not deleted yet), trying again in 192.446679ms
    file = nexus/src/app/sagas/common_storage.rs:290
    region_id = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.875Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): region deleted
    file = nexus/src/app/sagas/common_storage.rs:275
    region_id = 54f13fed-7e31-4580-b1a3-e4ff9ea05884
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:24.938Z INFO 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): region deleted
    file = nexus/src/app/sagas/common_storage.rs:275
    region_id = 1c6cf3ca-acc0-4d6b-a27d-e422feb6ac00
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:39.516Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions/71d2cb1c-2ca1-40d0-9f53-98a9b76df51a): operation timed out, retrying...
    file = nexus/src/app/sagas/mod.rs:358
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:39.516Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): failed external call (Communication Error: error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions/71d2cb1c-2ca1-40d0-9f53-98a9b76df51a): operation timed out), will retry in 299.584333ms
    file = nexus/src/app/sagas/mod.rs:403
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create
21:43:54.818Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions/71d2cb1c-2ca1-40d0-9f53-98a9b76df51a): operation timed out, retrying...
    file = nexus/src/app/sagas/mod.rs:358
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create

This looks to occur in the unwind of the disk-create sub-saga when it attempts to delete crucible regions. The delete will loop forever until it is able to access the requested region.

To an end user this is reported as an instance stuck in the creating state, even though it is guaranteed at this point to never to complete (i.e. is it trying to destroy itself as opposed to start up).

The text was updated successfully, but these errors were encountered:

askfongjojo · 2024-02-08T07:56:02Z

The bug that showed similar behavior is #4331 but I think this may be a new issue.

jmpesp · 2024-02-08T19:01:34Z

I believe it's the same issue as #4331 - the code in sagas can only proceed when there's a known result of some kind for a call to an external service, positive or negative, and both issues are a result of trying to get a known result out of unreachable services.

jmpesp · 2024-02-08T19:59:39Z

Nexus will retry in a loop, which is taken care of by the retry_until_known_result function. The message

21:43:39.516Z WARN 95ebe94d-0e68-421d-9260-c30bd7fe4bd6 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:111::5]:32345/crucible/0/regions/71d2cb1c-2ca1-40d0-9f53-98a9b76df51a): operation timed out, retrying...
    file = nexus/src/app/sagas/mod.rs:358
    saga_id = daf91d7c-7928-418a-ae54-e58c4a31b3ab
    saga_name = instance-create

comes from the section of that function that responds to progenitor_client::Error::CommunicationError:

                    Err(progenitor_client::Error::CommunicationError(e)) => {
                        warn!(
                            log,
                            "saw transient communication error {}, retrying...",
                            e,
                        );

                        Err(backoff::BackoffError::transient(
                            progenitor_client::Error::CommunicationError(e),
                        ))
                    }

It would be incorrect to treat a timeout as a permanent problem unless Nexus knows the service is never coming back. For example, if there's a transient network problem, the saga will wait until that is resolved in this backoff loop.

The problem comes from when services won't ever come back, so this saga will never resolve, which is what this issue is directly seeing.

It makes sense that there should be a bailout of the backoff retry, if it could be known that the timeouts were as result of something non-transient.

davepacheco · 2024-02-09T17:51:19Z

For permanent issues, there's #4259. If this is expected for transient issues, maybe this issue is a dup of that one?

davepacheco · 2024-04-23T21:47:16Z

We discussed on today's update call whether this issue is a blocker for R8. It is currently marked as such and nobody's assigned to work on it. We did not come to any conclusion. My hope was that this wasn't very relevant for R8 because I assumed (1) the instance-create saga is only trying to reach Crucible on sleds where a disk was provisioned in the same saga , and (2) during our initial use of sled expungement, we won't have in-progress instance creations; and if we did, they won't be creating disks on this sled. I'm not sure if these assumptions are true and nobody else weighed in so I'm not sure how essential this is for R8. @askfongjojo mentioned that we could also document recovery steps if this were to happen.

If there's a call to an external service, saga execution cannot move forward until the result of that call is known, in the sense that Nexus received a result. If there are transient problems, Nexus must retry until a known result is returned. This is problematic when the destination service is gone - Nexus will retry indefinitely, halting the saga execution. Worse, in the case of sagas calling the volume delete subsaga, subsequent calls will also halt. With the introduction of a physical disk policy, Nexus can know when to stop retrying a call - the destination service is gone, so the known result is an error. This commit adds a `ProgenitorOperationRetry` object that takes an operation to retry plus a "gone" check, and checks each retry iteration if the destination is gone. If it is, then bail out, otherwise assume that any errors seen are transient. Further work is required to deprecate the `retry_until_known_result` function, as retrying indefinitely is a bad pattern. Fixes oxidecomputer#4331 Fixes oxidecomputer#5022

askfongjojo mentioned this issue Feb 8, 2024

Nexus crashed upon hitting CRDB connection reset by peer [exhausted] error #5026

Open

morlandi7 added this to the 6 milestone Feb 8, 2024

morlandi7 modified the milestones: 6, 7 Feb 26, 2024

morlandi7 modified the milestones: 7, 8 Mar 12, 2024

davepacheco mentioned this issue Apr 17, 2024

what's involved with removing a sled? #4787

Closed

morlandi7 modified the milestones: 8, 9 May 13, 2024

jmpesp mentioned this issue May 17, 2024

Do not retry indefinitely if service is gone #5789

Merged

jmpesp closed this as completed in #5789 May 30, 2024

jmpesp closed this as completed in b07382f May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When crucible is unreachable instance-create saga can not unwind #5022

When crucible is unreachable instance-create saga can not unwind #5022

augustuswm commented Feb 7, 2024

askfongjojo commented Feb 8, 2024

jmpesp commented Feb 8, 2024

jmpesp commented Feb 8, 2024

davepacheco commented Feb 9, 2024

davepacheco commented Apr 23, 2024

When crucible is unreachable instance-create saga can not unwind #5022

When crucible is unreachable instance-create saga can not unwind #5022

Comments

augustuswm commented Feb 7, 2024

askfongjojo commented Feb 8, 2024

jmpesp commented Feb 8, 2024

jmpesp commented Feb 8, 2024

davepacheco commented Feb 9, 2024

davepacheco commented Apr 23, 2024