Do not retry indefinitely if service is gone #5789

jmpesp · 2024-05-17T19:34:04Z

If there's a call to an external service, saga execution cannot move forward until the result of that call is known, in the sense that Nexus received a result. If there are transient problems, Nexus must retry until a known result is returned.

This is problematic when the destination service is gone - Nexus will retry indefinitely, halting the saga execution. Worse, in the case of sagas calling the volume delete subsaga, subsequent calls that also call volume delete will also halt.

With the introduction of a physical disk policy, Nexus can know when to stop retrying a call - the destination service is gone, so the known result is an error.

This commit adds a ProgenitorOperationRetry object that takes an operation to retry plus a "gone" check, and checks each retry iteration if the destination is gone. If it is, then bail out, otherwise assume that any errors seen are transient.

Further work is required to deprecate the retry_until_known_result function, as retrying indefinitely is a bad pattern.

Fixes #4331
Fixes #5022

If there's a call to an external service, saga execution cannot move forward until the result of that call is known, in the sense that Nexus received a result. If there are transient problems, Nexus must retry until a known result is returned. This is problematic when the destination service is gone - Nexus will retry indefinitely, halting the saga execution. Worse, in the case of sagas calling the volume delete subsaga, subsequent calls will also halt. With the introduction of a physical disk policy, Nexus can know when to stop retrying a call - the destination service is gone, so the known result is an error. This commit adds a `ProgenitorOperationRetry` object that takes an operation to retry plus a "gone" check, and checks each retry iteration if the destination is gone. If it is, then bail out, otherwise assume that any errors seen are transient. Further work is required to deprecate the `retry_until_known_result` function, as retrying indefinitely is a bad pattern. Fixes oxidecomputer#4331 Fixes oxidecomputer#5022

hawkw

some smallish style suggestions to hopefully make the nested matches a little more comprehensible --- feel free to ignore me if you don't like them :)

common/src/progenitor_operation_retry.rs

hawkw · 2024-05-20T22:50:48Z

common/src/progenitor_operation_retry.rs

+                        Ok(dest_is_gone) => {
+                            if dest_is_gone {
+                                return Err(BackoffError::Permanent(
+                                    ProgenitorOperationRetryError::Gone
+                                ));
+                            }
+                        }


style nit, take it or leave it: this could be a bit more concise as:

Suggested change

Ok(dest_is_gone) => {

if dest_is_gone {

return Err(BackoffError::Permanent(

ProgenitorOperationRetryError::Gone

));

}

}

Ok(true) => return Err(BackoffError::Permanent(

ProgenitorOperationRetryError::Gone

)),

Ok(false) => {},

In this case, I prefer the more explicit style, but thanks :)

nexus/src/app/crucible.rs

jmpesp · 2024-05-21T17:20:05Z

some smallish style suggestions to hopefully make the nested matches a little more comprehensible --- feel free to ignore me if you don't like them :)

Quite the opposite, thanks for them! I find myself writing nested match code like this lately, so these suggestions help :)

Thanks Eliza! Co-authored-by: Eliza Weisman <[email protected]>

andrewjstone

James, Overall I like the solution. It maps to what we discussed wrt expunged. However, I left a few suggestions for architectural fixes that probably are worth at least a look.

common/src/lib.rs

nexus/db-queries/src/db/datastore/dataset.rs

nexus/src/app/crucible.rs

andrewjstone · 2024-05-22T23:42:44Z

nexus/tests/integration_tests/disks.rs

+    });
+
+    // It won't finish until the dataset is expunged.
+    tokio::time::sleep(Duration::from_secs(3)).await;


I know you need some heuristic to ensure something is not done, but I really find sleeps like this in tests to be problematic. Not only do they sometimes cause flakiness, but they also make tests take longer. 3 seconds here, 3 seconds there and pretty soon your talking real time.

My recommendation instead of a sleep is usually to put one side of a channel in the task and have the test task communicate with it to determine if a given state has been reached. That seems somewhat difficult in this case, as you are essentially checking to see if the call is hung. I'm not sure how to fix this, but I still don't like it!

I changed this test to just wait on the task in 0a45871, that works too.

This looks like it changes the semantics of the test though. You now have a much more likely chance that that the spawned task hasn't even started to run yet, making the assert!(!jh.is_finished()) somewhat meaningless. I still think on balance gettting rid of the sleep is the right, call, so I'm fine with this. But. maybe make a note about why there is no sleep and what the goal of the test is.

Nice catch, added a oneshot to wait until the task starts in 6917f42

nexus/src/app/crucible.rs

andrewjstone · 2024-05-22T23:54:00Z

nexus/src/app/crucible.rs

+        backoff::retry_notify(
+            backoff::retry_policy_internal_service_aggressive(),
+            || async {
+                let region = match self.maybe_get_crucible_region(


Maybe make a note that self.maybe_get_crucible_region will check for permanent errors and so we don't need to add a check to this loop.

Sorry, I don't quite grok this comment - this section matches against Err(e)?

Could be outdated. In either case, I wasn't even clear enough for myself to remember what I was talking about. Feel free to ignore it.

andrewjstone · 2024-05-22T23:57:13Z

nexus/src/app/crucible.rs

+
+                    // Return Ok if the dataset's agent is gone, no
+                    // delete call is required.
+                    Err(Error::Gone) => return Ok(()),


I finally noticed here, that we don't log when something is gone. Should we add that logging to the ProgenitorOperationRetry code or leave it to the users. In the latter case, we should add a log here, as well as all other places Gone is returned.

Good point - I don't think ProgenitorOperationRetry is a good place for this because it doesn't have any visibility into what is gone, just that the gone_check function returned true, so I've added it one layer up: see 0baa9c8

Great! Thanks.

add missing format!

andrewjstone

Thanks for all the hard work and cleanup @jmpesp!

jmpesp requested a review from andrewjstone May 17, 2024 19:34

hawkw reviewed May 20, 2024

View reviewed changes

jmpesp and others added 4 commits May 21, 2024 23:58

Merge branch 'main' into bail_out_of_retry_loop

d674024

Apply suggestions from code review

ae36201

Thanks Eliza! Co-authored-by: Eliza Weisman <[email protected]>

fmt

afade17

Merge branch 'main' into bail_out_of_retry_loop

2d71a87

andrewjstone reviewed May 22, 2024

View reviewed changes

This was referenced May 23, 2024

[#3886 1/4] Region replacement models and queries #5791

Merged

Deprecate retry_until_known_result #5813

Open

jmpesp added 6 commits May 24, 2024 05:34

pull matches into function and use that instead

ff39150

dataset's

c76e6fe

no more sleeping, just wait on task

0a45871

remove commented out code

4bf4bf5

dataset_on_in_service_physical_disk -> dataset_physical_disk_in_service

9e8fcc1

log when Gone is seen, along with what dataset is Gone

0baa9c8

add missing format!

andrewjstone approved these changes May 29, 2024

View reviewed changes

wait until the task starts

6917f42

jmpesp merged commit b07382f into oxidecomputer:main May 30, 2024
14 checks passed

jmpesp deleted the bail_out_of_retry_loop branch May 30, 2024 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not retry indefinitely if service is gone #5789

Do not retry indefinitely if service is gone #5789

jmpesp commented May 17, 2024

hawkw left a comment

hawkw May 20, 2024

jmpesp May 21, 2024

jmpesp commented May 21, 2024

andrewjstone left a comment

andrewjstone May 22, 2024

jmpesp May 24, 2024

andrewjstone May 29, 2024

jmpesp May 29, 2024

andrewjstone May 22, 2024

jmpesp May 29, 2024

andrewjstone May 29, 2024

andrewjstone May 22, 2024

jmpesp May 24, 2024

andrewjstone May 29, 2024

andrewjstone left a comment

Do not retry indefinitely if service is gone #5789

Do not retry indefinitely if service is gone #5789

Conversation

jmpesp commented May 17, 2024

hawkw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmpesp commented May 21, 2024

andrewjstone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment