Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: support dynamic tso service #8517

Closed
wants to merge 32 commits into from
Closed

Conversation

rleungx
Copy link
Member

@rleungx rleungx commented Aug 12, 2024

What problem does this PR solve?

Issue Number: ref #8477

What is changed and how does it work?

Check List

Tests

  • Unit test

Release note

None.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 12, 2024
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 15, 2024
@@ -660,7 +660,7 @@ func (c *client) Close() {
}
}

func (c *client) setServiceMode(newMode pdpb.ServiceMode) {
func (c *client) setServiceMode(newMode pdpb.ServiceMode, skipSameMode bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using a more straightforward word.

Suggested change
func (c *client) setServiceMode(newMode pdpb.ServiceMode, skipSameMode bool) {
func (c *client) setServiceMode(newMode pdpb.ServiceMode, force bool) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same as force.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not skipSameMode always?

}
errMsg := err.Error()
return strings.Contains(errMsg, "not found tso address") ||
strings.Contains(errMsg, "maximum number of retries exceeded")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error would also occur when the leadership cannot be elected. In which case will this be a misjudgment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't check this error on client side, do you know about the reason?

@@ -406,6 +406,8 @@ func TestTSOFollowerProxyWithTSOService(t *testing.T) {
backendEndpoints := pdLeaderServer.GetAddr()
tsoCluster, err := tests.NewTestTSOCluster(ctx, 2, backendEndpoints)
re.NoError(err)
// let service discovery know the TSO service
time.Sleep(500 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be replaced with an Eventually?

Copy link

codecov bot commented Aug 21, 2024

Codecov Report

Attention: Patch coverage is 76.43678% with 41 lines in your changes missing coverage. Please review.

Project coverage is 71.80%. Comparing base (26ced22) to head (edf3908).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8517      +/-   ##
==========================================
- Coverage   77.54%   71.80%   -5.74%     
==========================================
  Files         474      517      +43     
  Lines       62359    67531    +5172     
==========================================
+ Hits        48358    48493     +135     
- Misses      10441    15482    +5041     
+ Partials     3560     3556       -4     
Flag Coverage Δ
unittests 71.80% <76.43%> (-5.74%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@rleungx rleungx force-pushed the dynamic-switch branch 2 times, most recently from a36ad5d to f2c0c14 Compare August 23, 2024 07:38
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 26, 2024
@ti-chi-bot ti-chi-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2024
@@ -38,6 +38,16 @@ func IsLeaderChange(err error) bool {
strings.Contains(errMsg, NotPrimaryErr)
}

// IsServiceModeChange will determine whether there is a service mode change.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// IsServiceModeChange will determine whether there is a service mode change.
// IsServiceModeChange determines whether there is a service mode change.

if err != nil {
if needRetry := handleStreamError(err); needRetry {
continue
if s.forwardToTSOService() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reduce some of the indentation

Suggested change
if s.forwardToTSOService() {
if !s.forwardToTSOService() {
return s.tsoAllocatorManager.HandleRequest(ctx, tso.GlobalDCLocation, 1)
}
request := xxxx
.....

@@ -569,6 +590,72 @@ func (s *GrpcServer) Tso(stream pdpb.PD_TsoServer) error {
continue
}

if s.forwardToTSOService() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe wrap three situation(local tso/ms tso/normal) to three functions. Just a suggestion.

Comment on lines 410 to 419
if !c.IsServiceIndependent(constant.TSOServiceName) {
// leader tso service exit, tso independent service provide tso
c.tsoAllocator.ResetAllocatorGroup(tso.GlobalDCLocation, true)
}
if !c.IsServiceIndependent(constant.TSOServiceName) {
log.Info("TSO server starts to provide timestamp")
}
c.SetServiceIndependent(constant.TSOServiceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !c.IsServiceIndependent(constant.TSOServiceName) {
// leader tso service exit, tso independent service provide tso
c.tsoAllocator.ResetAllocatorGroup(tso.GlobalDCLocation, true)
}
if !c.IsServiceIndependent(constant.TSOServiceName) {
log.Info("TSO server starts to provide timestamp")
}
c.SetServiceIndependent(constant.TSOServiceName)
if !c.IsServiceIndependent(constant.TSOServiceName) {
// leader tso service exit, tso independent service provide tso
c.tsoAllocator.ResetAllocatorGroup(tso.GlobalDCLocation, true)
log.Info("TSO server starts to provide timestamp")
}
c.SetServiceIndependent(constant.TSOServiceName)

@@ -390,24 +397,84 @@ func (c *RaftCluster) checkServices() {
}
}

// checkTSOService checks the TSO service.
func (c *RaftCluster) checkTSOService() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more comments for this function and inside this function. Seems there are too many situations in this function


ctx, cancel := context.WithCancel(c.ctx)
defer cancel()
ticker := time.NewTicker(serviceModeUpdateInterval)
failpoint.Inject("fastUpdateServiceMode", func() {
ticker.Stop()
ticker = time.NewTicker(10 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ticker.Reset()?

client/client.go Outdated
@@ -713,6 +712,7 @@ func (c *client) resetTSOClientLocked(mode pdpb.ServiceMode) {
log.Warn("[pd] intend to switch to unknown service mode, just return")
return
}
// Replace the old TSO client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate

@@ -660,7 +660,7 @@ func (c *client) Close() {
}
}

func (c *client) setServiceMode(newMode pdpb.ServiceMode) {
func (c *client) setServiceMode(newMode pdpb.ServiceMode, skipSameMode bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not skipSameMode always?

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 2, 2024
Copy link
Contributor

ti-chi-bot bot commented Sep 6, 2024

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 29, 2024
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>

func (suite *APIServerForward) checkAvailableTSO(re *require.Assertions) {
mcs.WaitForTSOServiceAvailable(suite.ctx, re, suite.pdClient)
func (suite *APIServerForward) checkAvailableTSO(re *require.Assertions, needWait bool) {
Copy link
Member

@okJiang okJiang Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to wrap to two functions:

  1. checkAvailableTSO
  2. waitAvailableTSO

Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
return false
}
errMsg := err.Error()
return strings.Contains(errMsg, "not found tso address") ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a method to use ErrorCode or error type to instead of strings.Contains here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a constant as IsLeaderChange

if c.IsServiceIndependent(constant.TSOServiceName) {
log.Info("PD server starts to provide timestamp")
}
c.UnsetServiceIndependent(constant.TSOServiceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Unset the flag whatever it is tso service or pd server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add more comments for independentServices sync.Map

@@ -569,6 +590,72 @@ func (s *GrpcServer) Tso(stream pdpb.PD_TsoServer) error {
continue
}

if s.forwardToTSOService() {
if request.GetCount() == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse forwardTSO here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about how to reduce the duplicated code.

@@ -855,12 +856,16 @@ func (c *DRAutoSyncReplicationConfig) adjust(meta *configutil.ConfigMetaData) {
// MicroServiceConfig is the configuration for micro service.
type MicroServiceConfig struct {
EnableSchedulingFallback bool `toml:"enable-scheduling-fallback" json:"enable-scheduling-fallback,string"`
EnableTSOFallback bool `toml:"enable-tso-fallback" json:"enable-tso-fallback,string"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this neccesary? The code about this feature is so big that, I think, this switch can't control them as we expect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we don't want PD to provide the service after tso server is stopped.

if c.IsServiceIndependent(constant.TSOServiceName) {
log.Info("PD server starts to provide timestamp")
}
c.UnsetServiceIndependent(constant.TSOServiceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add more comments for independentServices sync.Map

ch := make(chan struct{})
ch1 := make(chan struct{})
wg.Add(1)
go func(ctx context.Context, wg *sync.WaitGroup, ch, ch1 chan struct{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we needn't delivery these params

@@ -578,3 +585,212 @@ func (suite *CommonTestSuite) TestBootstrapDefaultKeyspaceGroup() {
suite.pdLeader.ResignLeader()
suite.pdLeader = suite.cluster.GetServer(suite.cluster.WaitLeader())
}

func TestTSOServiceSwitch1(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a description for these three UTs. Or make the UT name more meaningful?

Comment on lines 645 to 646
ch1 <- struct{}{}
<-ch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to wrap a function

Suggested change
ch1 <- struct{}{}
<-ch
waitOneTS := func() {
ch1 <- struct{}{}
<-ch
}
...
waitOneTS()

re.NoError(err)
tsoCluster.WaitForDefaultPrimaryServing(re)

// Wait for TSO server to start and PD to detect it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some ways to detect this instead of Sleep

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a better idea.

tsoCluster.Destroy()

// Wait for the configuration change to take effect
time.Sleep(300 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

re.NoError(failpoint.Disable("github.com/tikv/pd/client/fastUpdateServiceMode"))
}

func TestTSOServiceSwitch3(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference with the above two?

Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>
Copy link
Contributor

ti-chi-bot bot commented Oct 15, 2024

@rleungx: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-integration-realcluster-test 064ca7b link false /test pull-integration-realcluster-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2024
Copy link
Contributor

ti-chi-bot bot commented Oct 18, 2024

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rleungx
Copy link
Member Author

rleungx commented Nov 11, 2024

Close it due to related PRs merged.

@rleungx rleungx closed this Nov 11, 2024
@rleungx rleungx deleted the dynamic-switch branch November 11, 2024 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the dco. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants