Retries #49

rucek · 2023-11-16T15:31:26Z

Retries

Inspired by https://github.com/softwaremill/retry

Rationale

The goal was to have a unified API (a single retry function) that would handle different ways of defining the operation (direct result, Try and Either), using various policies of delaying subsequent attempts (no delay, fixed delay, exponential backoff with an optional jitter), with a possibility to customize the definition of a successful result, and to fail fast on certain errors.

API

The basic syntax for retries is:

retry[T](operation)(policy)

Operation definition

The operation can be provided as one of:

a direct by-name parameter, i.e. f: => T
a by-name Try[T], i.e. f: => Try[T]
a by-name Either[E, T], i.e. f: => Either[E, T]

Policies

A retry policy consists of two parts:

a Schedule, which indicates how many times and with what delay should we retry the operation after an initial failure,
a ResultPolicy, which indicates whether:
- a non-erroneous outcome of the operation should be considered a success (if not, the operation would be retried),
- an erroneous outcome of the operation should be retried or fail fast.

The available schedules are defined in the Schedule object. Each schedule has a finite and an infinite variant.

Finite schedules

Finite schedules have a common maxRetries: Int parameter, which determines how many times the operation would be retried after an initial failure. This means that the operation could be executed at most maxRetries + 1 times.

Infinite schedules

Each finite schedule has an infinite variant, whose settings are similar to those of the respective finite schedule, but without the maxRetries setting. Using the infinite variant can lead to a possibly infinite number of retries (unless the operation starts to succeed again at some point). The infinite schedules are created by calling .forever on the companion object of the respective finite schedule (see examples below).

Schedule types

The supported schedules (specifically - their finite variants) are:

Immediate(maxRetries: Int) - retries up to maxRetries times without any delay between subsequent attempts.
Delay(maxRetries: Int, delay: FiniteDuration) - retries up to maxRetries times , sleeping for delay between subsequent attempts.
Backoff(maxRetries: Int, initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter) - retries up to maxRetries times , sleeping for initialDelay before the first retry, increasing the sleep between subsequent attempts exponentially (with base 2) up to an optional maxDelay (default: 1 minute).

Optionally, a random factor (jitter) can be used when calculating the delay before the next attempt. The purpose of jitter is to avoid clustering of subsequent retries, i.e. to reduce the number of clients calling a service exactly at the same time. See the AWS Architecture Blog article on backoff and jitter for a more in-depth explanation.

The following jitter strategies are available (defined in the Jitter enum):
- None - the default one, when no randomness is added, i.e. a pure exponential backoff is used,
- Full - picks a random value between 0 and the exponential backoff calculated for the current attempt,
- Equal - similar to Full, but prevents very short delays by always using a half of the original backoff and adding a random value between 0 and the other half,
- Decorrelated - uses the delay from the previous attempt (lastDelay) and picks a random value between the initalAttempt and 3 * lastDelay.

Result policies

A result policy allows to customize how the results of the operation are treated. It consists of two predicates:

isSuccess: T => Boolean (default: true) - determines whether a non-erroneous result of the operation should be considered a success. When it evaluates to true - no further attempts would be made, otherwise - we'd keep retrying.

With finite schedules (i.e. those with maxRetries defined), if isSuccess keeps returning false when maxRetries are reached, the result is returned as-is, even though it's considered "unsuccessful",
isWorthRetrying: E => Boolean (default: true) - determines whether another attempt would be made if the operation results in an error E. When it evaluates to true - we'd keep retrying, otherwise - we'd fail fast with the error.

The ResultPolicy[E, T] is generic both over the error (E) and result (T) type. Note, however, that for the direct and Try variants of the operation, the error type E is fixed to Throwable, while for the Either variant, E can ba an arbitrary type.

API shorthands

When you don't need to customize the result policy (i.e. use the default one), you can use one of the following shorthands to define a retry policy with a given schedule (note that the parameters are the same as when manually creating the respective Schedule):

RetryPolicy.immediate(maxRetries: Int),
RetryPolicy.immediateForever,
RetryPolicy.delay(maxRetries: Int, delay: FiniteDuration),
RetryPolicy.delayForever(delay: FiniteDuration),
RetryPolicy.backoff(maxRetries: Int, initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter),
RetryPolicy.backoffForever(initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter).

If you want to customize a part of the result policy, you can use the following shorthands:

ResultPolicy.default[E, T] - uses the default settings,
ResultPolicy.successfulWhen[E, T](isSuccess: T => Boolean) - uses the default isWorthRetrying and the provided isSuccess,
ResultPolicy.retryWhen[E, T](isWorthRetrying: E => Boolean) - uses the default isSuccess and the provided isWorthRetrying,
ResultPolicy.neverRetry[E, T] - uses the default isSuccess and fails fast on any error.

Examples

import ox.retry
import scala.concurrent.duration.*

def directOperation: Int = ???
def eitherOperation: Either[String, Int] = ???
def tryOperation: Try[Int] = ???

// various operation definitions - same syntax
retry(directOperation)(RetryPolicy.immediate(3))
retry(eitherOperation)(RetryPolicy.immediate(3))
retry(tryOperation)(RetryPolicy.immediate(3))

// various policies with custom schedules and default ResultPolicy
retry(directOperation)(RetryPolicy.delay(3, 100.millis))
retry(directOperation)(RetryPolicy.backoff(3, 100.millis)) // defaults: maxDelay = 1.minute, jitter = Jitter.None
retry(directOperation)(RetryPolicy.backoff(3, 100.millis, 5.minutes, Jitter.Equal))

// infinite retries with a default ResultPolicy
retry(directOperation)(RetryPolicy.delayForever(100.millis))
retry(directOperation)(RetryPolicy.backoffForever(100.millis, 5.minutes, Jitter.Full))

// result policies
// custom success
retry(directOperation)(RetryPolicy(Schedule.Immediate(3), ResultPolicy.successfulWhen(_ > 0)))
// fail fast on certain errors
retry(directOperation)(RetryPolicy(Schedule.Immediate(3), ResultPolicy.retryWhen(_.getMessage != "fatal error")))
retry(eitherOperation)(RetryPolicy(Schedule.Immediate(3), ResultPolicy.retryWhen(_ != "fatal error"))(3))

See the tests in ox.retry.* for more.

Implementation choices

`@tailrec` vs `while`

To make the infinite policies stack-safe, the actual implementation of retry is tail-recursive. This resulted in some code duplication in the implementation, but it still seems more readable and nicer than the alternative variant with a plain-old while loop with a couple of vars for state management.

Possible next steps

multi-policies, i.e. different policies for different outcomes (e.g. retry immediately on some errors but backoff on others)
composable policies (i.e. fall back to another policy when the original one fails)
side-effecting callback to run on each attempt (for logging, metrics etc.)
repeat(operation)(Schedule) plus a stop condition

adamw · 2023-11-20T13:22:52Z

Maybe you could mention in the PR's description, which features mentioned in #48 are covered, and provide some high-level examples of how the API should be used? The tests provide this somewhat, but it would also be good to know some rationale behind the design :)

core/src/main/scala/ox/retry/retry.scala

core/src/test/scala/ox/ElapsedTime.scala

core/src/test/scala/ox/retry/JitterTest.scala

… elapsed time

core/src/main/scala/ox/retry/retry.scala

adamw · 2023-11-28T15:24:32Z

Great description - can be almost 1-1 be copied as docs to the readme - thanks :)

I haven't read the code yet, but some initial questions:

how are infinite policies created?
what's the intuition behind Direct - is it any different than Delay(0)? Maybe Immediate would be an alternative?
different policies for different outcomes (e.g. retry on some errors but fail fast on others) - isn't that what worthRetrying does?
composable policies (i.e. fall back to another policy when the original one fails) - I guess you can nest retrys? E.g. the inner one could only handle errors where there's a specific exception (isWorthRetrying = _.isInstaceOf[MySpecialException]), and the outer one can then provide the generic mechanism? Though the counter wouldn't be global then, but maybe that's a feature ;)
shouldn't isSuccess, isWorthRetrying be part of the RetryPolicy? These look ... policy-like ;) As in "the retry policy is to retry DBException up to 5 times with a pause of 3 seconds". Otherwise, maybe RetryPolicy is in fact RepetitionPolicy, which could then be used as an argument to a RetryPolicy?
would a repeat(op)(RetryPolicy) make sense as another future improvement?

rucek · 2023-11-29T12:17:58Z

Great description - can be almost 1-1 be copied as docs to the readme - thanks :)

Thanks, the idea was exactly to avoid doing the same work twice :)

1. how are infinite policies created?

By calling .forever on the companion object of the respective finite policy. I was sure this was mentioned in the description, but apparently it wasn't - I updated the docs and examples.

Internally, they are implemented as separate, private case classes marked with the Infinite trait - contrary to the finite ones, which inherit from Finite.

2. what's the intuition behind Direct - is it any different than Delay(0)? Maybe Immediate would be an alternative?

They are conceptually the same, Direct is a shorthand which internally hardcodes a delay of length 0. Other than that Delay(0) works exactly like Direct, since delays of lengh 0 are ignored (to avoid caling Thread.sleep(0)).

Indeed Immediate seems a better name for such a shorthand.

3. different policies for different outcomes (e.g. retry on some errors but fail fast on others) - isn't that what worthRetrying does?

You're right, although it's the example I gave that's incorrect. The idea would be to use completely different policies based on the outcome, e.g. Delay for some type of errors and Backoff for others, rather than just using the fail-fast feature provided by isWorthRetrying.

4. composable policies (i.e. fall back to another policy when the original one fails) - I guess you can nest retrys? E.g. the inner one could only handle errors where there's a specific exception (isWorthRetrying = _.isInstaceOf[MySpecialException]), and the outer one can then provide the generic mechanism? Though the counter wouldn't be global then, but maybe that's a feature ;)

Right, it seems it's achievable with nesting, assuming that a global counter is desired. We can consider whether nesting is enough user-friendly for such use case (or, ideally, get some feedback from real users ;)), and whether the retry limit should be global (or maybe, with a better API, we could let the user choose a global vs. per-policy limit?)

5. shouldn't isSuccess, isWorthRetrying be part of the RetryPolicy? These look ... policy-like ;) As in "the retry policy is to retry DBException up to 5 times with a pause of 3 seconds". Otherwise, maybe RetryPolicy is in fact RepetitionPolicy, which could then be used as an argument to a RetryPolicy?

I think you're right in that the current "retry policies" are actually "retry schedules", since they only cover the timing aspect of retries, while isSuccess/isWorthRetrying are based on the result of the logic under retry. Taking your idea forward, perhaps a RetryPolicy should consist of a RetrySchedule (or just a Schedule so that it's also suitable for repeat) and a ResultPolicy - let me try this approach.

6. would a repeat(op)(RetryPolicy) make sense as another future improvement?

Yes, why not. It seems that a repeat would need the timing aspect (the Schedule above) and a stop condition (which could perhaps be a variant of the above ResultPolicy, since it seems a bit similar to isSuccess, i.e. it would be a T => Boolean)

adamw · 2023-11-29T13:59:49Z

By calling .forever on the companion object of the respective finite policy. I was sure this was mentioned in the description, but apparently it wasn't - I updated the docs and examples.

So then we have e.g. Direct(5).forever? This looks a bit weird, as I have to specify the number of repetitions, and then discard it.

You're right, although it's the example I gave that's incorrect. The idea would be to use completely different policies based on the outcome, e.g. Delay for some type of errors and Backoff for others, rather than just using the fail-fast feature provided by isWorthRetrying.

Ah yes. I guess we could have sth like a MultiPolicy which would have different error -> policy mappings. But that's future work :)

Right, it seems it's achievable with nesting, assuming that a global counter is desired. We can consider whether nesting is enough user-friendly for such use case (or, ideally, get some feedback from real users ;)), and whether the retry limit should be global (or maybe, with a better API, we could let the user choose a global vs. per-policy limit?)

I doubt a global counter is what we'd want, but I think that for many scenarios simple nesting would work (with separate counters). But yes, let's wait for user feedback.

Yes, why not. It seems that a repeat would need the timing aspect (the Schedule above) and a stop condition (which could perhaps be a variant of the above ResultPolicy, since it seems a bit similar to isSuccess, i.e. it would be a T => Boolean)

Another point for further work :)

rucek · 2023-11-29T14:08:07Z

So then we have e.g. Direct(5).forever? This looks a bit weird, as I have to specify the number of repetitions, and then discard it.

No, you call .forever on the companion object, not on the instance of the policy. forever has all the settings of the respective finite policy except for maxRetries. In your case this would be Direct.forever (since the only setting of Direct is maxRetries), other examples are:

Delay.forever(100.millis)
Backoff.forever(100.millis, 5.minutes, Jitter.Full)

adamw · 2023-11-29T14:17:43Z

No, you call .forever on the companion object, not on the instance of the policy. forever has all the settings of the respective finite policy except for maxRetries. In your case this would be Direct.forever (since the only setting of Direct is maxRetries), other examples are:

Ah ok, good then :)

rucek · 2023-11-30T12:35:39Z

@adamw a couple of updates after our discussion:

renamed Direct to Immediate
RetryPolicy now consists of a Schedule and a ResultPolicy, so the syntax for retries is now always
```
retry(op)(policy)
```
This also reduces the number of required retry signatures to 3 (one for each way of definig the operation) - so, overall, the code is much less messy now and the API is simpler and unified :)
added API shorthands when only the schedule is customized, or when a subset of the ResultPolicy is customized - see "API shorthands" and examples in the updated PR description
added the description (without the rationale, implementation choices and next steps) to doc/retries.md (linked from the main README as well)

rucek · 2023-11-30T13:58:18Z

And, thanks to the simplified API, the DummyImplicits are no longer needed

adamw · 2023-11-30T14:11:44Z

Again, just read the docs - looks good :) 🚀

Implementation choices -> maybe ADR?

kciesielski

LGTM! Although the composability of RetryPolicy may be tricky to achieve in the future, I don't think we need to account for it now.

kciesielski · 2023-12-04T08:35:15Z

core/src/main/scala/ox/retry/retry.scala

+        else right
+
+  val remainingAttempts = policy.schedule match
+    case policy: Schedule.Finite => Some(policy.maxRetries)


nitpick case schedule or case finiteSchedule.

Good catch, thanks, got lost in ~~translation~~ refactoring ;)

adamw · 2023-12-04T15:42:26Z

Awesome! Now only a release & blog :)

rucek added 5 commits November 16, 2023 15:36

Extract ElapsedTime so that it can be reused

1fc3c1d

Remove old retries

3cc1eba

Basic retry for function/Try/Either

cb5792e

Delays and simple backoff

0469fda

Initial isWorthRetrying for Either

d61e292

rucek added 2 commits November 21, 2023 16:13

Add upper bound for delay when using backoff

4c267d5

Add jitter

5b430e7

kciesielski reviewed Nov 22, 2023

View reviewed changes

rucek added 4 commits November 23, 2023 09:03

Remove unnecessary toList

52bd207

Fail on unexpected match

51fe027

Use System.nanoTime instead of System.currentTimeMillis for measuring…

c364df0

… elapsed time

Fix formatting

57f0b43

lmlynik reviewed Nov 25, 2023

View reviewed changes

core/src/main/scala/ox/retry/retry.scala Outdated Show resolved Hide resolved

rucek added 5 commits November 27, 2023 13:23

Add retries with unlimited attempts, make the retry logic tail-recursive

e7c5d7d

Use enum for jitter

41fedb0

Add isWorthRetrying for functions and Try

17cde01

Move RetryPolicy to a separate file

98b9d17

Reduce maximum backoff delay to 1 minute

2991dac

rucek requested review from kciesielski and adamw November 28, 2023 07:05

rucek added 4 commits November 30, 2023 10:31

Refactor RetryPolicy to include a schedule and a result policy

24925a5

Rename Direct to Immediate

57716f9

Update test name

67566e1

Add custom conditions to fail-fast tests

77091ad

Add docs for retries

d11454f

rucek added 3 commits November 30, 2023 13:51

Add retries to main README

0caa72b

Update docs on ResultPolicy error type

01926c6

Remove DummyImplicit's as they are no longer needed

c148380

rucek added 2 commits November 30, 2023 15:25

Fix typo in retries readme

ebc6cdf

Scaladocs and cleanup

d671f21

rucek marked this pull request as ready for review November 30, 2023 16:15

Add ADR for retries

b2cf1ee

kciesielski approved these changes Dec 4, 2023

View reviewed changes

rucek added 2 commits December 4, 2023 09:52

Fix naming

62e40ef

Add syntax sugar for retries

5510a98

rucek merged commit 01e00d2 into master Dec 4, 2023
4 checks passed

rucek deleted the retries branch December 4, 2023 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retries #49

Retries #49

rucek commented Nov 16, 2023 •

edited

Loading

adamw commented Nov 20, 2023

adamw commented Nov 28, 2023

rucek commented Nov 29, 2023 •

edited

Loading

adamw commented Nov 29, 2023

rucek commented Nov 29, 2023

adamw commented Nov 29, 2023

rucek commented Nov 30, 2023 •

edited

Loading

rucek commented Nov 30, 2023

adamw commented Nov 30, 2023

kciesielski left a comment

kciesielski Dec 4, 2023

rucek Dec 4, 2023

adamw commented Dec 4, 2023

Retries #49

Retries #49

Conversation

rucek commented Nov 16, 2023 • edited Loading

Retries

Rationale

API

Operation definition

Policies

Finite schedules

Infinite schedules

Schedule types

Result policies

API shorthands

Examples

Implementation choices

@tailrec vs while

Possible next steps

adamw commented Nov 20, 2023

adamw commented Nov 28, 2023

rucek commented Nov 29, 2023 • edited Loading

adamw commented Nov 29, 2023

rucek commented Nov 29, 2023

adamw commented Nov 29, 2023

rucek commented Nov 30, 2023 • edited Loading

rucek commented Nov 30, 2023

adamw commented Nov 30, 2023

kciesielski left a comment

Choose a reason for hiding this comment

kciesielski Dec 4, 2023

Choose a reason for hiding this comment

rucek Dec 4, 2023

Choose a reason for hiding this comment

adamw commented Dec 4, 2023

rucek commented Nov 16, 2023 •

edited

Loading

`@tailrec` vs `while`

rucek commented Nov 29, 2023 •

edited

Loading

rucek commented Nov 30, 2023 •

edited

Loading