Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect state possible after retrying ServiceDiscoverer events #3006

Merged
merged 5 commits into from
Jul 18, 2024

Conversation

idelpivnitskiy
Copy link
Member

Motivation:

Clients have a configurable serviceDiscovererRetryStrategy to guarantee a steady stream of events to the LoadBalancer that never fails. It's necessary at the client level to avoid hanging requests indefinitely and let requests observe failures from ServiceDiscoverer. Also, for PartitionedHttpClient it's necessary to guarantee that GroupedPublisher never fails.

Retry is effectively a re-subscribe. According to ServiceDiscoverer contract (clarified in #3002), each Subscriber receives a "state of the world" as the first collection of events. The problem is that the state may change significantly between retries. As a result, unavailable addresses can remain inside the LoadBalancer forever. Example:

T1. SD delivers [a,b]
T1. LB receives [a,b]
T1. SD delivers error
T2. SD info changed ("a" got revoked)
T3. Client retries SD
T3. SD delivers [b]
T3. LB receives [b] (but still holds "a")

When we retry ServiceDiscoverer errors, we should keep pushing deltas downstream or purge events that are not present in the new "state of the world".

We previously had this protection but it was mistakenly removed in #1949 as part of a broader refactoring around ServiceDiscoverer <-> LoadBalancer contract.

Modifications:

  • Add RetryingServiceDiscoverer that handles retries and keeps the state between retries.
  • Use it in DefaultSingleAddressHttpClientBuilder and DefaultPartitionedHttpClientBuilder.
  • Use CastedServiceDiscoverer to allow modifications for ServiceDiscovererEvent after we started to use a wildcard type in Expand allowed types to accommodate custom service discoverer events #2379.
  • Pass consistent targetResource identifier to both RetryingServiceDiscoverer and LoadBalancerFactory to allow state correlation when inspecting heap dump.

Result:

Client keeps pushing deltas to LoadBalancer after retrying ServiceDiscoverer errors, keeping its state consistent with ServiceDiscoverer.

Motivation:

Clients have a configurable `serviceDiscovererRetryStrategy` to
guarantee a steady stream of events to the `LoadBalancer` that never
fails. It's necessary at the client level to avoid hanging requests
indefinitely and let requests observe failures from ServiceDiscoverer.
Also, for `PartitionedHttpClient` it's necessary to guarantee that
`GroupedPublisher` never fails.

Retry is effectively a re-subscribe. According to `ServiceDiscoverer`
contract (clarified in apple#3002), each `Subscriber` receives a "state of
the world" as the first collection of events. The problem is that the
state may change significantly between retries, as a result unavailable
addresses can remain inside the `LoadBalancer` forever. Example:

T1. SD delivers [a,b]
T1. LB receives [a,b]
T1. SD delivers error
T2. SD info changed ("a" got revoked)
T3. Client retries SD
T3. SD delivers [b]
T3. LB receives [b] (but still holds "a")

When we retry `ServiceDiscoverer` errors, we should keep pushing deltas
downstream or purge events that are not present in the new "state of the
world".

We previously had this protection but it was mistakenly removed in apple#1949
as part of a broader refactoring around `ServiceDiscoverer` <->
`LoadBalancer` contract.

Modifications:

- Add `RetryingServiceDiscoverer` that handles retries and keeps the
state between retries.
- Use it in `DefaultSingleAddressHttpClientBuilder` and
`DefaultPartitionedHttpClientBuilder`.
- Use `CastedServiceDiscoverer` to allow modifications for
`ServiceDiscovererEvent` after we started to use a wildcard type in
apple#2379.
- Pass consistent `targetResource` identifier to both
`RetryingServiceDiscoverer` and `LoadBalancerFactory` to allow state
correlation when inspecting heap dump.

Result:

Client keeps pushing deltas to `LoadBalancer` after retrying
`ServiceDiscoverer` errors, keeping its state consistent with
`ServiceDiscoverer`.
@idelpivnitskiy idelpivnitskiy self-assigned this Jul 12, 2024
@idelpivnitskiy
Copy link
Member Author

Marking it as "draft" because I still need to implement tests for RetryingServiceDiscoverer, but want to get earlier feedback for the main part before I work on tests.

new ServiceDiscovererEventsCache<>(targetResource, makeUnavailable);
return delegate().discover(address)
.map(eventsCache::consumeAndFilter)
.beforeOnError(eventsCache::errorSeen)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative approach could be to use scanWithMapper or liftSync instead of (defer + map + beforeOnError). I picked this one for simplicity bcz internals of scanWithMapper look a bit complicated for this task and liftSync is too low level API.

if (UNAVAILABLE.equals(event.status())) {
currentState.remove(event.address());
} else {
currentState.put(event.address(), event);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that even though this will retain only the last event for the address, it should not matter because the retainedState will be used only to propagate UNAVAILABLE state that will simply remove it from LB. Any other associated meta-data (if any) doesn't matter for UNAVAILABLE state.

Copy link
Contributor

@bryce-anderson bryce-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only got partially through this and will continue in another session.

@idelpivnitskiy idelpivnitskiy marked this pull request as ready for review July 13, 2024 01:49
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;

class RetryingServiceDiscovererTest {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryce-anderson bryce-anderson self-requested a review July 15, 2024 15:45
Comment on lines +827 to +828
// Because of the change in https://github.com/apple/servicetalk/pull/2379, we should constrain the type back to
// ServiceDiscovererEvent without "? extends" to allow RetryingServiceDiscoverer to mark events as UNAVAILABLE.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can back out these changes? I think it's technically a breaking change but afaict we could never practically make use of them because there isn't a type parameter on the ClientBuilder that lets us bridge the type between the ServiceDiscoverer and the LB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about bridging the type between SD and LB, it's about a possibility to pass a custom SD with custom event subtype to HttpClients. Without this, it won't be possible to use those implementations, they will have to change to use ServiceDiscovererEvent instead.

import static java.time.Duration.ofSeconds;
import static java.util.Collections.emptyMap;

final class RetryingServiceDiscoverer<U, R, E extends ServiceDiscovererEvent<R>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this more broadly accessible? and if-so, consider adding shareContextOnSub inside the defer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I don't have a use-case in mind where we need to share this "filter".
for now, I will keep it without shareContextOnSubscribe() because subscribe can be driven from any request and I don't think the background discovery context needs to be shared with that.

Copy link
Contributor

@tkountis tkountis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

It should re-subscribe to the same publisher instead of calling
`discover` one more time
@idelpivnitskiy idelpivnitskiy merged commit b58d0b5 into apple:main Jul 18, 2024
11 checks passed
@idelpivnitskiy idelpivnitskiy deleted the RetryingServiceDiscoverer branch July 18, 2024 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants