Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

align with grpc base/balancer to trigger reconnect in Idle state (when we move to gRPC 1.41+) #335

Merged
merged 5 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion broker/protocol/dispatcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -117,12 +117,19 @@ func (d *dispatcher) UpdateSubConnState(sc balancer.SubConn, state balancer.SubC
panic("unexpected SubConn")
}

if state.ConnectivityState == connectivity.Connecting && d.connState[sc] == connectivity.TransientFailure {
if (state.ConnectivityState == connectivity.Connecting || state.ConnectivityState == connectivity.Idle) &&
d.connState[sc] == connectivity.TransientFailure {
// gRPC will quickly transition failed connections back into a Connecting
// state. In many cases, such as a remote-initiated close from a
// shutting-down server, the SubConn may never return. Until we see a
// successful re-connect, continue to consider the SubConn as broken
// (and trigger invalidations of cached Routes which use it).

if state.ConnectivityState == connectivity.Idle {
sc.Connect()
}
d.mu.Unlock()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to return here, where we don't return with the Connect() call below?
As far as I can tell, the purpose is to not call UpdateState below. Is that true, and why would it a problem to do so?

(I'm trying to figure out if this code block could be eliminated, and instead have the check below be the only place where Connect() is called).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the core problem (as I understand it) is that Idle sub-connections won't actually re-connect unless sc.Connect() is explicitly called, right?

Was this a change in gRPC behavior? I'm confused how this isn't a problem we've experienced ourselves. For that matter, help me understand: how does a previously-active SubConn come to be Idle again (my very old recollection is they bounce between failure and Connecting)?

In any case, since the check below will cause Connect to be called, is there a particular reason that we don't want to update connState[sc] if it's state.Idle now?

Put another way, would the root issue would be resolve with just the addition below of:

	if state.ConnectivityState == connectivity.Idle {
		sc.Connect()
	}

Almost certainly I'm missing something 🤷 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I am going to re-look into the changes that occurred in gRPC over time, as it has been updated in our master a couple of times, to see if it plays any part. Agree that it is strange it has not popped up in your deployment.

A lot of my investigation was the evolution (via blame) in https://github.com/grpc/grpc-go/blob/master/balancer/base/balancer.go#L180 as we appeared to be lined up pretty closely with the overall pattern of UpdateSubConnState() but they have been tweaking it over time.

I will get back with hopefully good answers to your questions and your simplification may be valid.

Copy link
Contributor Author

@ddowker ddowker Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking but i think this gRPC PR grpc/grpc-go@03268c8#diff-f9772420643575b997f617c6d4c1934aaa26f057042c7fa71a521ef5bb2af253 is the one where gRPC have introduced the transition to the Idle state and adjusted their own example loadbalancer to work with that.

In their example balancer/base/balancer.go they were returning in the top check for Idle state and bypassing the call to UpdateState() so i followed that. I can dig deeper on that to see why that may or may not make a difference.

In looking deeper i think this change above may be in a later version of gRPC than the gazette repo is using (as gazette go.mod shows v1.40.0). Our application mono-repo uses gRPC in a lot of services and our go.mod is at v.1.52.0. It may explain why you do not see this issue and we do (in our consumers). Let me confirm the exact version that the commit above shows up.

Copy link
Contributor Author

@ddowker ddowker Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the gRPC change above was for the 1.41.0 release which the gazette repo does not yet use. They spell out the behavior changes here: https://github.com/grpc/grpc-go/releases/tag/v1.41.0 eventhough the commit i link to is not explicitly called out it follows the balancer #4613 PR they mention.

So my PR may be premature for wider release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will talk to Michael when he comes back next week to see if we should close this PR to master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgraettinger i guess these(or similar) changes will be required when gRPC is upgraded to 1.41+. For now i changed the title to reflect this. I will leave open for now but will close if you think that is best.

} else {
d.connState[sc] = state.ConnectivityState
}
Expand All @@ -132,6 +139,10 @@ func (d *dispatcher) UpdateSubConnState(sc balancer.SubConn, state balancer.SubC
delete(d.connID, sc)
delete(d.connState, sc)
}

if state.ConnectivityState == connectivity.Idle {
sc.Connect()
}
d.mu.Unlock()

// Notify gRPC that block requests may now be able to proceed.
Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ require (
github.com/mattn/go-sqlite3 v2.0.3+incompatible
github.com/olekukonko/tablewriter v0.0.5
github.com/pkg/errors v0.9.1
github.com/prometheus/client_golang v1.11.0
github.com/prometheus/client_golang v1.11.1
github.com/prometheus/client_model v0.2.0
github.com/sirupsen/logrus v1.8.1
github.com/soheilhy/cmux v0.1.5
Expand Down
3 changes: 2 additions & 1 deletion go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -342,8 +342,9 @@ github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZN
github.com/prometheus/client_golang v0.9.1/go.mod h1:7SWBe2y4D6OKWSNQJUaRYU/AaXPKyh/dDVn+NZz0KFw=
github.com/prometheus/client_golang v1.0.0/go.mod h1:db9x61etRT2tGnBNRi70OPL5FsnadC4Ky3P0J6CfImo=
github.com/prometheus/client_golang v1.7.1/go.mod h1:PY5Wy2awLA44sXw4AOSfFBetzPP4j5+D6mVACh+pe2M=
github.com/prometheus/client_golang v1.11.0 h1:HNkLOAEQMIDv/K+04rukrLx6ch7msSRwf3/SASFAGtQ=
github.com/prometheus/client_golang v1.11.0/go.mod h1:Z6t4BnS23TR94PD6BsDNk8yVqroYurpAkEiz0P2BEV0=
github.com/prometheus/client_golang v1.11.1 h1:+4eQaD7vAZ6DsfsxB15hbE0odUjGI5ARs9yskGu1v4s=
github.com/prometheus/client_golang v1.11.1/go.mod h1:Z6t4BnS23TR94PD6BsDNk8yVqroYurpAkEiz0P2BEV0=
github.com/prometheus/client_model v0.0.0-20180712105110-5c3871d89910/go.mod h1:MbSGuTsp3dbXC40dX6PRTWyKYBIrTGTE9sqQNg2J8bo=
github.com/prometheus/client_model v0.0.0-20190129233127-fd36f4220a90/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA=
github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA=
Expand Down