Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not retry indefinitely if service is gone #5789

Merged
merged 12 commits into from
May 30, 2024

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented May 17, 2024

If there's a call to an external service, saga execution cannot move forward until the result of that call is known, in the sense that Nexus received a result. If there are transient problems, Nexus must retry until a known result is returned.

This is problematic when the destination service is gone - Nexus will retry indefinitely, halting the saga execution. Worse, in the case of sagas calling the volume delete subsaga, subsequent calls that also call volume delete will also halt.

With the introduction of a physical disk policy, Nexus can know when to stop retrying a call - the destination service is gone, so the known result is an error.

This commit adds a ProgenitorOperationRetry object that takes an operation to retry plus a "gone" check, and checks each retry iteration if the destination is gone. If it is, then bail out, otherwise assume that any errors seen are transient.

Further work is required to deprecate the retry_until_known_result function, as retrying indefinitely is a bad pattern.

Fixes #4331
Fixes #5022

@jmpesp jmpesp requested a review from andrewjstone May 17, 2024 19:34
If there's a call to an external service, saga execution cannot move
forward until the result of that call is known, in the sense that Nexus
received a result. If there are transient problems, Nexus must retry
until a known result is returned.

This is problematic when the destination service is gone - Nexus will
retry indefinitely, halting the saga execution. Worse, in the case of
sagas calling the volume delete subsaga, subsequent calls will also
halt.

With the introduction of a physical disk policy, Nexus can know when to
stop retrying a call - the destination service is gone, so the known
result is an error.

This commit adds a `ProgenitorOperationRetry` object that takes an
operation to retry plus a "gone" check, and checks each retry iteration
if the destination is gone. If it is, then bail out, otherwise assume
that any errors seen are transient.

Further work is required to deprecate the `retry_until_known_result`
function, as retrying indefinitely is a bad pattern.

Fixes oxidecomputer#4331
Fixes oxidecomputer#5022
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some smallish style suggestions to hopefully make the nested matches a little more comprehensible --- feel free to ignore me if you don't like them :)

common/src/progenitor_operation_retry.rs Outdated Show resolved Hide resolved
Comment on lines +81 to +87
Ok(dest_is_gone) => {
if dest_is_gone {
return Err(BackoffError::Permanent(
ProgenitorOperationRetryError::Gone
));
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit, take it or leave it: this could be a bit more concise as:

Suggested change
Ok(dest_is_gone) => {
if dest_is_gone {
return Err(BackoffError::Permanent(
ProgenitorOperationRetryError::Gone
));
}
}
Ok(true) => return Err(BackoffError::Permanent(
ProgenitorOperationRetryError::Gone
)),
Ok(false) => {},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I prefer the more explicit style, but thanks :)

nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
@jmpesp
Copy link
Contributor Author

jmpesp commented May 21, 2024

some smallish style suggestions to hopefully make the nested matches a little more comprehensible --- feel free to ignore me if you don't like them :)

Quite the opposite, thanks for them! I find myself writing nested match code like this lately, so these suggestions help :)

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James, Overall I like the solution. It maps to what we discussed wrt expunged. However, I left a few suggestions for architectural fixes that probably are worth at least a look.

common/src/lib.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/dataset.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/dataset.rs Outdated Show resolved Hide resolved
nexus/src/app/crucible.rs Show resolved Hide resolved
nexus/src/app/crucible.rs Show resolved Hide resolved
nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
});

// It won't finish until the dataset is expunged.
tokio::time::sleep(Duration::from_secs(3)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you need some heuristic to ensure something is not done, but I really find sleeps like this in tests to be problematic. Not only do they sometimes cause flakiness, but they also make tests take longer. 3 seconds here, 3 seconds there and pretty soon your talking real time.

My recommendation instead of a sleep is usually to put one side of a channel in the task and have the test task communicate with it to determine if a given state has been reached. That seems somewhat difficult in this case, as you are essentially checking to see if the call is hung. I'm not sure how to fix this, but I still don't like it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this test to just wait on the task in 0a45871, that works too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it changes the semantics of the test though. You now have a much more likely chance that that the spawned task hasn't even started to run yet, making the assert!(!jh.is_finished()) somewhat meaningless. I still think on balance gettting rid of the sleep is the right, call, so I'm fine with this. But. maybe make a note about why there is no sleep and what the goal of the test is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, added a oneshot to wait until the task starts in 6917f42

nexus/src/app/crucible.rs Outdated Show resolved Hide resolved
backoff::retry_notify(
backoff::retry_policy_internal_service_aggressive(),
|| async {
let region = match self.maybe_get_crucible_region(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make a note that self.maybe_get_crucible_region will check for permanent errors and so we don't need to add a check to this loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't quite grok this comment - this section matches against Err(e)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be outdated. In either case, I wasn't even clear enough for myself to remember what I was talking about. Feel free to ignore it.


// Return Ok if the dataset's agent is gone, no
// delete call is required.
Err(Error::Gone) => return Ok(()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally noticed here, that we don't log when something is gone. Should we add that logging to the ProgenitorOperationRetry code or leave it to the users. In the latter case, we should add a log here, as well as all other places Gone is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I don't think ProgenitorOperationRetry is a good place for this because it doesn't have any visibility into what is gone, just that the gone_check function returned true, so I've added it one layer up: see 0baa9c8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work and cleanup @jmpesp!

@jmpesp jmpesp merged commit b07382f into oxidecomputer:main May 30, 2024
14 checks passed
@jmpesp jmpesp deleted the bail_out_of_retry_loop branch May 30, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants