Incorrect state possible after retrying `ServiceDiscoverer` events #3006

idelpivnitskiy · 2024-07-12T17:43:56Z

Motivation:

Clients have a configurable serviceDiscovererRetryStrategy to guarantee a steady stream of events to the LoadBalancer that never fails. It's necessary at the client level to avoid hanging requests indefinitely and let requests observe failures from ServiceDiscoverer. Also, for PartitionedHttpClient it's necessary to guarantee that GroupedPublisher never fails.

Retry is effectively a re-subscribe. According to ServiceDiscoverer contract (clarified in #3002), each Subscriber receives a "state of the world" as the first collection of events. The problem is that the state may change significantly between retries. As a result, unavailable addresses can remain inside the LoadBalancer forever. Example:

T1. SD delivers [a,b]
T1. LB receives [a,b]
T1. SD delivers error
T2. SD info changed ("a" got revoked)
T3. Client retries SD
T3. SD delivers [b]
T3. LB receives [b] (but still holds "a")

When we retry ServiceDiscoverer errors, we should keep pushing deltas downstream or purge events that are not present in the new "state of the world".

We previously had this protection but it was mistakenly removed in #1949 as part of a broader refactoring around ServiceDiscoverer <-> LoadBalancer contract.

Modifications:

Add RetryingServiceDiscoverer that handles retries and keeps the state between retries.
Use it in DefaultSingleAddressHttpClientBuilder and DefaultPartitionedHttpClientBuilder.
Use CastedServiceDiscoverer to allow modifications for ServiceDiscovererEvent after we started to use a wildcard type in Expand allowed types to accommodate custom service discoverer events #2379.
Pass consistent targetResource identifier to both RetryingServiceDiscoverer and LoadBalancerFactory to allow state correlation when inspecting heap dump.

Result:

Client keeps pushing deltas to LoadBalancer after retrying ServiceDiscoverer errors, keeping its state consistent with ServiceDiscoverer.

Motivation: Clients have a configurable `serviceDiscovererRetryStrategy` to guarantee a steady stream of events to the `LoadBalancer` that never fails. It's necessary at the client level to avoid hanging requests indefinitely and let requests observe failures from ServiceDiscoverer. Also, for `PartitionedHttpClient` it's necessary to guarantee that `GroupedPublisher` never fails. Retry is effectively a re-subscribe. According to `ServiceDiscoverer` contract (clarified in apple#3002), each `Subscriber` receives a "state of the world" as the first collection of events. The problem is that the state may change significantly between retries, as a result unavailable addresses can remain inside the `LoadBalancer` forever. Example: T1. SD delivers [a,b] T1. LB receives [a,b] T1. SD delivers error T2. SD info changed ("a" got revoked) T3. Client retries SD T3. SD delivers [b] T3. LB receives [b] (but still holds "a") When we retry `ServiceDiscoverer` errors, we should keep pushing deltas downstream or purge events that are not present in the new "state of the world". We previously had this protection but it was mistakenly removed in apple#1949 as part of a broader refactoring around `ServiceDiscoverer` <-> `LoadBalancer` contract. Modifications: - Add `RetryingServiceDiscoverer` that handles retries and keeps the state between retries. - Use it in `DefaultSingleAddressHttpClientBuilder` and `DefaultPartitionedHttpClientBuilder`. - Use `CastedServiceDiscoverer` to allow modifications for `ServiceDiscovererEvent` after we started to use a wildcard type in apple#2379. - Pass consistent `targetResource` identifier to both `RetryingServiceDiscoverer` and `LoadBalancerFactory` to allow state correlation when inspecting heap dump. Result: Client keeps pushing deltas to `LoadBalancer` after retrying `ServiceDiscoverer` errors, keeping its state consistent with `ServiceDiscoverer`.

idelpivnitskiy · 2024-07-12T17:45:04Z

Marking it as "draft" because I still need to implement tests for RetryingServiceDiscoverer, but want to get earlier feedback for the main part before I work on tests.

idelpivnitskiy · 2024-07-12T22:09:43Z

servicetalk-http-netty/src/main/java/io/servicetalk/http/netty/RetryingServiceDiscoverer.java

+                    new ServiceDiscovererEventsCache<>(targetResource, makeUnavailable);
+            return delegate().discover(address)
+                    .map(eventsCache::consumeAndFilter)
+                    .beforeOnError(eventsCache::errorSeen)


Alternative approach could be to use scanWithMapper or liftSync instead of (defer + map + beforeOnError). I picked this one for simplicity bcz internals of scanWithMapper look a bit complicated for this task and liftSync is too low level API.

idelpivnitskiy · 2024-07-12T22:12:25Z

servicetalk-http-netty/src/main/java/io/servicetalk/http/netty/RetryingServiceDiscoverer.java

+                    if (UNAVAILABLE.equals(event.status())) {
+                        currentState.remove(event.address());
+                    } else {
+                        currentState.put(event.address(), event);


Note that even though this will retain only the last event for the address, it should not matter because the retainedState will be used only to propagate UNAVAILABLE state that will simply remove it from LB. Any other associated meta-data (if any) doesn't matter for UNAVAILABLE state.

bryce-anderson

I only got partially through this and will continue in another session.

servicetalk-http-netty/src/main/java/io/servicetalk/http/netty/RetryingServiceDiscoverer.java

idelpivnitskiy · 2024-07-13T01:50:08Z

...cetalk-http-netty/src/test/java/io/servicetalk/http/netty/RetryingServiceDiscovererTest.java

+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+class RetryingServiceDiscovererTest {


Recovered all tests from https://github.com/apple/servicetalk/pull/1949/files#diff-17f80381993b8c2b74a8b78674c065228c8ef16a9320fe23a11abffdfe629eb9 and enhanced those to cover more edge cases

servicetalk-http-netty/src/main/java/io/servicetalk/http/netty/RetryingServiceDiscoverer.java

bryce-anderson · 2024-07-15T20:34:35Z

...ttp-netty/src/main/java/io/servicetalk/http/netty/DefaultSingleAddressHttpClientBuilder.java

+    // Because of the change in https://github.com/apple/servicetalk/pull/2379, we should constrain the type back to
+    // ServiceDiscovererEvent without "? extends" to allow RetryingServiceDiscoverer to mark events as UNAVAILABLE.


Is there a way we can back out these changes? I think it's technically a breaking change but afaict we could never practically make use of them because there isn't a type parameter on the ClientBuilder that lets us bridge the type between the ServiceDiscoverer and the LB.

It's not about bridging the type between SD and LB, it's about a possibility to pass a custom SD with custom event subtype to HttpClients. Without this, it won't be possible to use those implementations, they will have to change to use ServiceDiscovererEvent instead.

tkountis · 2024-07-18T16:46:27Z

servicetalk-http-netty/src/main/java/io/servicetalk/http/netty/RetryingServiceDiscoverer.java

+import static java.time.Duration.ofSeconds;
+import static java.util.Collections.emptyMap;
+
+final class RetryingServiceDiscoverer<U, R, E extends ServiceDiscovererEvent<R>>


should this more broadly accessible? and if-so, consider adding shareContextOnSub inside the defer.

no, I don't have a use-case in mind where we need to share this "filter".
for now, I will keep it without shareContextOnSubscribe() because subscribe can be driven from any request and I don't think the background discovery context needs to be shared with that.

tkountis

lgtm

It should re-subscribe to the same publisher instead of calling `discover` one more time

idelpivnitskiy self-assigned this Jul 12, 2024

idelpivnitskiy requested review from bryce-anderson and tkountis July 12, 2024 18:00

idelpivnitskiy commented Jul 12, 2024

View reviewed changes

bryce-anderson reviewed Jul 13, 2024

View reviewed changes

servicetalk-http-netty/src/main/java/io/servicetalk/http/netty/RetryingServiceDiscoverer.java Outdated Show resolved Hide resolved

idelpivnitskiy added 2 commits July 12, 2024 18:48

RetryingServiceDiscoverer: ctor arg fix and comment clarification

0452768

Add RetryingServiceDiscovererTest

883ce99

idelpivnitskiy marked this pull request as ready for review July 13, 2024 01:49

idelpivnitskiy commented Jul 13, 2024

View reviewed changes

bryce-anderson self-requested a review July 15, 2024 15:45

add warning if receive UNAVAILABLE events after re-subscribe

a9c3eb6

bryce-anderson approved these changes Jul 15, 2024

View reviewed changes

tkountis reviewed Jul 18, 2024

View reviewed changes

tkountis approved these changes Jul 18, 2024

View reviewed changes

Fix noUnavailableEventsAfterCancel() test

99ebe67

It should re-subscribe to the same publisher instead of calling `discover` one more time

idelpivnitskiy merged commit b58d0b5 into apple:main Jul 18, 2024
11 checks passed

idelpivnitskiy deleted the RetryingServiceDiscoverer branch July 18, 2024 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect state possible after retrying `ServiceDiscoverer` events #3006

Incorrect state possible after retrying `ServiceDiscoverer` events #3006

idelpivnitskiy commented Jul 12, 2024

idelpivnitskiy commented Jul 12, 2024

idelpivnitskiy Jul 12, 2024

idelpivnitskiy Jul 12, 2024

bryce-anderson left a comment

idelpivnitskiy Jul 13, 2024

bryce-anderson Jul 15, 2024

idelpivnitskiy Jul 15, 2024

tkountis Jul 18, 2024

idelpivnitskiy Jul 18, 2024

tkountis left a comment

		// Because of the change in https://github.com/apple/servicetalk/pull/2379, we should constrain the type back to
		// ServiceDiscovererEvent without "? extends" to allow RetryingServiceDiscoverer to mark events as UNAVAILABLE.

Incorrect state possible after retrying ServiceDiscoverer events #3006

Incorrect state possible after retrying ServiceDiscoverer events #3006

Conversation

idelpivnitskiy commented Jul 12, 2024

idelpivnitskiy commented Jul 12, 2024

idelpivnitskiy Jul 12, 2024

Choose a reason for hiding this comment

idelpivnitskiy Jul 12, 2024

Choose a reason for hiding this comment

bryce-anderson left a comment

Choose a reason for hiding this comment

idelpivnitskiy Jul 13, 2024

Choose a reason for hiding this comment

bryce-anderson Jul 15, 2024

Choose a reason for hiding this comment

idelpivnitskiy Jul 15, 2024

Choose a reason for hiding this comment

tkountis Jul 18, 2024

Choose a reason for hiding this comment

idelpivnitskiy Jul 18, 2024

Choose a reason for hiding this comment

tkountis left a comment

Choose a reason for hiding this comment

Incorrect state possible after retrying `ServiceDiscoverer` events #3006

Incorrect state possible after retrying `ServiceDiscoverer` events #3006