-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect state possible after retrying ServiceDiscoverer
events
#3006
Incorrect state possible after retrying ServiceDiscoverer
events
#3006
Conversation
Motivation: Clients have a configurable `serviceDiscovererRetryStrategy` to guarantee a steady stream of events to the `LoadBalancer` that never fails. It's necessary at the client level to avoid hanging requests indefinitely and let requests observe failures from ServiceDiscoverer. Also, for `PartitionedHttpClient` it's necessary to guarantee that `GroupedPublisher` never fails. Retry is effectively a re-subscribe. According to `ServiceDiscoverer` contract (clarified in apple#3002), each `Subscriber` receives a "state of the world" as the first collection of events. The problem is that the state may change significantly between retries, as a result unavailable addresses can remain inside the `LoadBalancer` forever. Example: T1. SD delivers [a,b] T1. LB receives [a,b] T1. SD delivers error T2. SD info changed ("a" got revoked) T3. Client retries SD T3. SD delivers [b] T3. LB receives [b] (but still holds "a") When we retry `ServiceDiscoverer` errors, we should keep pushing deltas downstream or purge events that are not present in the new "state of the world". We previously had this protection but it was mistakenly removed in apple#1949 as part of a broader refactoring around `ServiceDiscoverer` <-> `LoadBalancer` contract. Modifications: - Add `RetryingServiceDiscoverer` that handles retries and keeps the state between retries. - Use it in `DefaultSingleAddressHttpClientBuilder` and `DefaultPartitionedHttpClientBuilder`. - Use `CastedServiceDiscoverer` to allow modifications for `ServiceDiscovererEvent` after we started to use a wildcard type in apple#2379. - Pass consistent `targetResource` identifier to both `RetryingServiceDiscoverer` and `LoadBalancerFactory` to allow state correlation when inspecting heap dump. Result: Client keeps pushing deltas to `LoadBalancer` after retrying `ServiceDiscoverer` errors, keeping its state consistent with `ServiceDiscoverer`.
Marking it as "draft" because I still need to implement tests for |
new ServiceDiscovererEventsCache<>(targetResource, makeUnavailable); | ||
return delegate().discover(address) | ||
.map(eventsCache::consumeAndFilter) | ||
.beforeOnError(eventsCache::errorSeen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative approach could be to use scanWithMapper
or liftSync
instead of (defer
+ map
+ beforeOnError
). I picked this one for simplicity bcz internals of scanWithMapper
look a bit complicated for this task and liftSync
is too low level API.
if (UNAVAILABLE.equals(event.status())) { | ||
currentState.remove(event.address()); | ||
} else { | ||
currentState.put(event.address(), event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that even though this will retain only the last event for the address
, it should not matter because the retainedState
will be used only to propagate UNAVAILABLE
state that will simply remove it from LB. Any other associated meta-data (if any) doesn't matter for UNAVAILABLE
state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only got partially through this and will continue in another session.
servicetalk-http-netty/src/main/java/io/servicetalk/http/netty/RetryingServiceDiscoverer.java
Outdated
Show resolved
Hide resolved
import static org.mockito.Mockito.mock; | ||
import static org.mockito.Mockito.when; | ||
|
||
class RetryingServiceDiscovererTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recovered all tests from https://github.com/apple/servicetalk/pull/1949/files#diff-17f80381993b8c2b74a8b78674c065228c8ef16a9320fe23a11abffdfe629eb9 and enhanced those to cover more edge cases
servicetalk-http-netty/src/main/java/io/servicetalk/http/netty/RetryingServiceDiscoverer.java
Outdated
Show resolved
Hide resolved
// Because of the change in https://github.com/apple/servicetalk/pull/2379, we should constrain the type back to | ||
// ServiceDiscovererEvent without "? extends" to allow RetryingServiceDiscoverer to mark events as UNAVAILABLE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can back out these changes? I think it's technically a breaking change but afaict we could never practically make use of them because there isn't a type parameter on the ClientBuilder that lets us bridge the type between the ServiceDiscoverer and the LB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about bridging the type between SD and LB, it's about a possibility to pass a custom SD with custom event subtype to HttpClients
. Without this, it won't be possible to use those implementations, they will have to change to use ServiceDiscovererEvent
instead.
import static java.time.Duration.ofSeconds; | ||
import static java.util.Collections.emptyMap; | ||
|
||
final class RetryingServiceDiscoverer<U, R, E extends ServiceDiscovererEvent<R>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this more broadly accessible? and if-so, consider adding shareContextOnSub
inside the defer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I don't have a use-case in mind where we need to share this "filter".
for now, I will keep it without shareContextOnSubscribe()
because subscribe can be driven from any request and I don't think the background discovery context needs to be shared with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
It should re-subscribe to the same publisher instead of calling `discover` one more time
Motivation:
Clients have a configurable
serviceDiscovererRetryStrategy
to guarantee a steady stream of events to theLoadBalancer
that never fails. It's necessary at the client level to avoid hanging requests indefinitely and let requests observe failures from ServiceDiscoverer. Also, forPartitionedHttpClient
it's necessary to guarantee thatGroupedPublisher
never fails.Retry is effectively a re-subscribe. According to
ServiceDiscoverer
contract (clarified in #3002), eachSubscriber
receives a "state of the world" as the first collection of events. The problem is that the state may change significantly between retries. As a result, unavailable addresses can remain inside theLoadBalancer
forever. Example:T1. SD delivers [a,b]
T1. LB receives [a,b]
T1. SD delivers error
T2. SD info changed ("a" got revoked)
T3. Client retries SD
T3. SD delivers [b]
T3. LB receives [b] (but still holds "a")
When we retry
ServiceDiscoverer
errors, we should keep pushing deltas downstream or purge events that are not present in the new "state of the world".We previously had this protection but it was mistakenly removed in #1949 as part of a broader refactoring around
ServiceDiscoverer
<->LoadBalancer
contract.Modifications:
RetryingServiceDiscoverer
that handles retries and keeps the state between retries.DefaultSingleAddressHttpClientBuilder
andDefaultPartitionedHttpClientBuilder
.CastedServiceDiscoverer
to allow modifications forServiceDiscovererEvent
after we started to use a wildcard type in Expand allowed types to accommodate custom service discoverer events #2379.targetResource
identifier to bothRetryingServiceDiscoverer
andLoadBalancerFactory
to allow state correlation when inspecting heap dump.Result:
Client keeps pushing deltas to
LoadBalancer
after retryingServiceDiscoverer
errors, keeping its state consistent withServiceDiscoverer
.