Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retries #49

Merged
merged 29 commits into from
Dec 4, 2023
Merged

Retries #49

merged 29 commits into from
Dec 4, 2023

Conversation

rucek
Copy link
Contributor

@rucek rucek commented Nov 16, 2023

Retries

Inspired by https://github.com/softwaremill/retry

Rationale

The goal was to have a unified API (a single retry function) that would handle different ways of defining the operation (direct result, Try and Either), using various policies of delaying subsequent attempts (no delay, fixed delay, exponential backoff with an optional jitter), with a possibility to customize the definition of a successful result, and to fail fast on certain errors.

API

The basic syntax for retries is:

retry[T](operation)(policy)

Operation definition

The operation can be provided as one of:

  • a direct by-name parameter, i.e. f: => T
  • a by-name Try[T], i.e. f: => Try[T]
  • a by-name Either[E, T], i.e. f: => Either[E, T]

Policies

A retry policy consists of two parts:

  • a Schedule, which indicates how many times and with what delay should we retry the operation after an initial failure,
  • a ResultPolicy, which indicates whether:
    • a non-erroneous outcome of the operation should be considered a success (if not, the operation would be retried),
    • an erroneous outcome of the operation should be retried or fail fast.

The available schedules are defined in the Schedule object. Each schedule has a finite and an infinite variant.

Finite schedules

Finite schedules have a common maxRetries: Int parameter, which determines how many times the operation would be retried after an initial failure. This means that the operation could be executed at most maxRetries + 1 times.

Infinite schedules

Each finite schedule has an infinite variant, whose settings are similar to those of the respective finite schedule, but without the maxRetries setting. Using the infinite variant can lead to a possibly infinite number of retries (unless the operation starts to succeed again at some point). The infinite schedules are created by calling .forever on the companion object of the respective finite schedule (see examples below).

Schedule types

The supported schedules (specifically - their finite variants) are:

  • Immediate(maxRetries: Int) - retries up to maxRetries times without any delay between subsequent attempts.

  • Delay(maxRetries: Int, delay: FiniteDuration) - retries up to maxRetries times , sleeping for delay between subsequent attempts.

  • Backoff(maxRetries: Int, initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter) - retries up to maxRetries times , sleeping for initialDelay before the first retry, increasing the sleep between subsequent attempts exponentially (with base 2) up to an optional maxDelay (default: 1 minute).

    Optionally, a random factor (jitter) can be used when calculating the delay before the next attempt. The purpose of jitter is to avoid clustering of subsequent retries, i.e. to reduce the number of clients calling a service exactly at the same time. See the AWS Architecture Blog article on backoff and jitter for a more in-depth explanation.

    The following jitter strategies are available (defined in the Jitter enum):

    • None - the default one, when no randomness is added, i.e. a pure exponential backoff is used,
    • Full - picks a random value between 0 and the exponential backoff calculated for the current attempt,
    • Equal - similar to Full, but prevents very short delays by always using a half of the original backoff and adding a random value between 0 and the other half,
    • Decorrelated - uses the delay from the previous attempt (lastDelay) and picks a random value between the initalAttempt and 3 * lastDelay.

Result policies

A result policy allows to customize how the results of the operation are treated. It consists of two predicates:

  • isSuccess: T => Boolean (default: true) - determines whether a non-erroneous result of the operation should be considered a success. When it evaluates to true - no further attempts would be made, otherwise - we'd keep retrying.

    With finite schedules (i.e. those with maxRetries defined), if isSuccess keeps returning false when maxRetries are reached, the result is returned as-is, even though it's considered "unsuccessful",

  • isWorthRetrying: E => Boolean (default: true) - determines whether another attempt would be made if the operation results in an error E. When it evaluates to true - we'd keep retrying, otherwise - we'd fail fast with the error.

The ResultPolicy[E, T] is generic both over the error (E) and result (T) type. Note, however, that for the direct and Try variants of the operation, the error type E is fixed to Throwable, while for the Either variant, E can ba an arbitrary type.

API shorthands

When you don't need to customize the result policy (i.e. use the default one), you can use one of the following shorthands to define a retry policy with a given schedule (note that the parameters are the same as when manually creating the respective Schedule):

  • RetryPolicy.immediate(maxRetries: Int),
  • RetryPolicy.immediateForever,
  • RetryPolicy.delay(maxRetries: Int, delay: FiniteDuration),
  • RetryPolicy.delayForever(delay: FiniteDuration),
  • RetryPolicy.backoff(maxRetries: Int, initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter),
  • RetryPolicy.backoffForever(initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter).

If you want to customize a part of the result policy, you can use the following shorthands:

  • ResultPolicy.default[E, T] - uses the default settings,
  • ResultPolicy.successfulWhen[E, T](isSuccess: T => Boolean) - uses the default isWorthRetrying and the provided isSuccess,
  • ResultPolicy.retryWhen[E, T](isWorthRetrying: E => Boolean) - uses the default isSuccess and the provided isWorthRetrying,
  • ResultPolicy.neverRetry[E, T] - uses the default isSuccess and fails fast on any error.

Examples

import ox.retry
import scala.concurrent.duration.*

def directOperation: Int = ???
def eitherOperation: Either[String, Int] = ???
def tryOperation: Try[Int] = ???

// various operation definitions - same syntax
retry(directOperation)(RetryPolicy.immediate(3))
retry(eitherOperation)(RetryPolicy.immediate(3))
retry(tryOperation)(RetryPolicy.immediate(3))

// various policies with custom schedules and default ResultPolicy
retry(directOperation)(RetryPolicy.delay(3, 100.millis))
retry(directOperation)(RetryPolicy.backoff(3, 100.millis)) // defaults: maxDelay = 1.minute, jitter = Jitter.None
retry(directOperation)(RetryPolicy.backoff(3, 100.millis, 5.minutes, Jitter.Equal))

// infinite retries with a default ResultPolicy
retry(directOperation)(RetryPolicy.delayForever(100.millis))
retry(directOperation)(RetryPolicy.backoffForever(100.millis, 5.minutes, Jitter.Full))

// result policies
// custom success
retry(directOperation)(RetryPolicy(Schedule.Immediate(3), ResultPolicy.successfulWhen(_ > 0)))
// fail fast on certain errors
retry(directOperation)(RetryPolicy(Schedule.Immediate(3), ResultPolicy.retryWhen(_.getMessage != "fatal error")))
retry(eitherOperation)(RetryPolicy(Schedule.Immediate(3), ResultPolicy.retryWhen(_ != "fatal error"))(3))

See the tests in ox.retry.* for more.

Implementation choices

@tailrec vs while

To make the infinite policies stack-safe, the actual implementation of retry is tail-recursive. This resulted in some code duplication in the implementation, but it still seems more readable and nicer than the alternative variant with a plain-old while loop with a couple of vars for state management.

Possible next steps

  • multi-policies, i.e. different policies for different outcomes (e.g. retry immediately on some errors but backoff on others)
  • composable policies (i.e. fall back to another policy when the original one fails)
  • side-effecting callback to run on each attempt (for logging, metrics etc.)
  • repeat(operation)(Schedule) plus a stop condition

@adamw
Copy link
Member

adamw commented Nov 20, 2023

Maybe you could mention in the PR's description, which features mentioned in #48 are covered, and provide some high-level examples of how the API should be used? The tests provide this somewhat, but it would also be good to know some rationale behind the design :)

core/src/main/scala/ox/retry/retry.scala Outdated Show resolved Hide resolved
core/src/main/scala/ox/retry/retry.scala Outdated Show resolved Hide resolved
core/src/test/scala/ox/ElapsedTime.scala Outdated Show resolved Hide resolved
core/src/test/scala/ox/retry/JitterTest.scala Outdated Show resolved Hide resolved
@rucek rucek requested review from kciesielski and adamw November 28, 2023 07:05
@adamw
Copy link
Member

adamw commented Nov 28, 2023

Great description - can be almost 1-1 be copied as docs to the readme - thanks :)

I haven't read the code yet, but some initial questions:

  1. how are infinite policies created?
  2. what's the intuition behind Direct - is it any different than Delay(0)? Maybe Immediate would be an alternative?
  3. different policies for different outcomes (e.g. retry on some errors but fail fast on others) - isn't that what worthRetrying does?
  4. composable policies (i.e. fall back to another policy when the original one fails) - I guess you can nest retrys? E.g. the inner one could only handle errors where there's a specific exception (isWorthRetrying = _.isInstaceOf[MySpecialException]), and the outer one can then provide the generic mechanism? Though the counter wouldn't be global then, but maybe that's a feature ;)
  5. shouldn't isSuccess, isWorthRetrying be part of the RetryPolicy? These look ... policy-like ;) As in "the retry policy is to retry DBException up to 5 times with a pause of 3 seconds". Otherwise, maybe RetryPolicy is in fact RepetitionPolicy, which could then be used as an argument to a RetryPolicy?
  6. would a repeat(op)(RetryPolicy) make sense as another future improvement?

@rucek
Copy link
Contributor Author

rucek commented Nov 29, 2023

Great description - can be almost 1-1 be copied as docs to the readme - thanks :)

Thanks, the idea was exactly to avoid doing the same work twice :)

1. how are infinite policies created?

By calling .forever on the companion object of the respective finite policy. I was sure this was mentioned in the description, but apparently it wasn't - I updated the docs and examples.

Internally, they are implemented as separate, private case classes marked with the Infinite trait - contrary to the finite ones, which inherit from Finite.

2. what's the intuition behind Direct - is it any different than Delay(0)? Maybe Immediate would be an alternative?

They are conceptually the same, Direct is a shorthand which internally hardcodes a delay of length 0. Other than that Delay(0) works exactly like Direct, since delays of lengh 0 are ignored (to avoid caling Thread.sleep(0)).

Indeed Immediate seems a better name for such a shorthand.

3. different policies for different outcomes (e.g. retry on some errors but fail fast on others) - isn't that what worthRetrying does?

You're right, although it's the example I gave that's incorrect. The idea would be to use completely different policies based on the outcome, e.g. Delay for some type of errors and Backoff for others, rather than just using the fail-fast feature provided by isWorthRetrying.

4. composable policies (i.e. fall back to another policy when the original one fails) - I guess you can nest retrys? E.g. the inner one could only handle errors where there's a specific exception (isWorthRetrying = _.isInstaceOf[MySpecialException]), and the outer one can then provide the generic mechanism? Though the counter wouldn't be global then, but maybe that's a feature ;)

Right, it seems it's achievable with nesting, assuming that a global counter is desired. We can consider whether nesting is enough user-friendly for such use case (or, ideally, get some feedback from real users ;)), and whether the retry limit should be global (or maybe, with a better API, we could let the user choose a global vs. per-policy limit?)

5. shouldn't isSuccess, isWorthRetrying be part of the RetryPolicy? These look ... policy-like ;) As in "the retry policy is to retry DBException up to 5 times with a pause of 3 seconds". Otherwise, maybe RetryPolicy is in fact RepetitionPolicy, which could then be used as an argument to a RetryPolicy?

I think you're right in that the current "retry policies" are actually "retry schedules", since they only cover the timing aspect of retries, while isSuccess/isWorthRetrying are based on the result of the logic under retry. Taking your idea forward, perhaps a RetryPolicy should consist of a RetrySchedule (or just a Schedule so that it's also suitable for repeat) and a ResultPolicy - let me try this approach.

6. would a repeat(op)(RetryPolicy) make sense as another future improvement?

Yes, why not. It seems that a repeat would need the timing aspect (the Schedule above) and a stop condition (which could perhaps be a variant of the above ResultPolicy, since it seems a bit similar to isSuccess, i.e. it would be a T => Boolean)

@adamw
Copy link
Member

adamw commented Nov 29, 2023

By calling .forever on the companion object of the respective finite policy. I was sure this was mentioned in the description, but apparently it wasn't - I updated the docs and examples.

So then we have e.g. Direct(5).forever? This looks a bit weird, as I have to specify the number of repetitions, and then discard it.

You're right, although it's the example I gave that's incorrect. The idea would be to use completely different policies based on the outcome, e.g. Delay for some type of errors and Backoff for others, rather than just using the fail-fast feature provided by isWorthRetrying.

Ah yes. I guess we could have sth like a MultiPolicy which would have different error -> policy mappings. But that's future work :)

Right, it seems it's achievable with nesting, assuming that a global counter is desired. We can consider whether nesting is enough user-friendly for such use case (or, ideally, get some feedback from real users ;)), and whether the retry limit should be global (or maybe, with a better API, we could let the user choose a global vs. per-policy limit?)

I doubt a global counter is what we'd want, but I think that for many scenarios simple nesting would work (with separate counters). But yes, let's wait for user feedback.

Yes, why not. It seems that a repeat would need the timing aspect (the Schedule above) and a stop condition (which could perhaps be a variant of the above ResultPolicy, since it seems a bit similar to isSuccess, i.e. it would be a T => Boolean)

Another point for further work :)

@rucek
Copy link
Contributor Author

rucek commented Nov 29, 2023

So then we have e.g. Direct(5).forever? This looks a bit weird, as I have to specify the number of repetitions, and then discard it.

No, you call .forever on the companion object, not on the instance of the policy. forever has all the settings of the respective finite policy except for maxRetries. In your case this would be Direct.forever (since the only setting of Direct is maxRetries), other examples are:

Delay.forever(100.millis)
Backoff.forever(100.millis, 5.minutes, Jitter.Full)

@adamw
Copy link
Member

adamw commented Nov 29, 2023

No, you call .forever on the companion object, not on the instance of the policy. forever has all the settings of the respective finite policy except for maxRetries. In your case this would be Direct.forever (since the only setting of Direct is maxRetries), other examples are:

Ah ok, good then :)

@rucek
Copy link
Contributor Author

rucek commented Nov 30, 2023

@adamw a couple of updates after our discussion:

  • renamed Direct to Immediate
  • RetryPolicy now consists of a Schedule and a ResultPolicy, so the syntax for retries is now always
    retry(op)(policy)
    This also reduces the number of required retry signatures to 3 (one for each way of definig the operation) - so, overall, the code is much less messy now and the API is simpler and unified :)
  • added API shorthands when only the schedule is customized, or when a subset of the ResultPolicy is customized - see "API shorthands" and examples in the updated PR description
  • added the description (without the rationale, implementation choices and next steps) to doc/retries.md (linked from the main README as well)

@rucek
Copy link
Contributor Author

rucek commented Nov 30, 2023

And, thanks to the simplified API, the DummyImplicits are no longer needed

@adamw
Copy link
Member

adamw commented Nov 30, 2023

Again, just read the docs - looks good :) 🚀

Implementation choices -> maybe ADR?

@rucek rucek marked this pull request as ready for review November 30, 2023 16:15
Copy link
Member

@kciesielski kciesielski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Although the composability of RetryPolicy may be tricky to achieve in the future, I don't think we need to account for it now.

else right

val remainingAttempts = policy.schedule match
case policy: Schedule.Finite => Some(policy.maxRetries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick case schedule or case finiteSchedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks, got lost in translation refactoring ;)

@rucek rucek merged commit 01e00d2 into master Dec 4, 2023
4 checks passed
@rucek rucek deleted the retries branch December 4, 2023 12:11
@adamw
Copy link
Member

adamw commented Dec 4, 2023

Awesome! Now only a release & blog :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants