Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: fix panic when checking IsLearner of removed member #18606

Merged
merged 1 commit into from
Oct 4, 2024

Conversation

jscissr
Copy link
Contributor

@jscissr jscissr commented Sep 19, 2024

Previously, calling s.IsLearner() when the local node is no longer a member panics. There was an attempt to fix this by first checking IsMemberExist(), but this is not a correct fix because the member could be removed between the two calls. Instead of panicking when the member was removed, IsLearner() should return false. A node which is not a member is also not a learner.

There was a similar concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls.

I did not add a unit test because it's basically impossible to test for such concurrency bugs.

@k8s-ci-robot
Copy link

Hi @jscissr. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov-commenter
Copy link

codecov-commenter commented Sep 19, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.74%. Comparing base (5704c61) to head (85dfd2d).
Report is 8 commits behind head on main.

Current head 85dfd2d differs from pull request most recent head 3374e27

Please upload reports for the commit 3374e27 to get more accurate results.

Files with missing lines Patch % Lines
client/v3/client.go 0.00% 1 Missing ⚠️
server/etcdserver/server.go 75.00% 0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
Files with missing lines Coverage Δ
client/v3/retry_interceptor.go 65.61% <100.00%> (ø)
server/etcdserver/api/membership/cluster.go 88.53% <100.00%> (ø)
client/v3/client.go 84.89% <0.00%> (ø)
server/etcdserver/server.go 81.28% <75.00%> (+0.01%) ⬆️

... and 21 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18606      +/-   ##
==========================================
- Coverage   68.77%   68.74%   -0.04%     
==========================================
  Files         420      420              
  Lines       35535    35536       +1     
==========================================
- Hits        24439    24428      -11     
- Misses       9666     9678      +12     
  Partials     1430     1430              

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5704c61...3374e27. Read the comment docs.

@ahrtr
Copy link
Member

ahrtr commented Sep 22, 2024

Previously, calling s.IsLearner() when the local node is no longer a member panics.

Did you see a case where panic happened or can you create an e2e or integration test case to make it happen?

@jscissr
Copy link
Contributor Author

jscissr commented Sep 23, 2024

Did you see a case where panic happened or can you create an e2e or integration test case to make it happen?

Yes I can, here are integration tests which demonstrate the panic. I need to add an artificial delay in IsMemberExist to reliably show the panic.

diff --git a/server/etcdserver/api/membership/cluster.go b/server/etcdserver/api/membership/cluster.go
index 6becdfd62..4b6dbda64 100644
--- a/server/etcdserver/api/membership/cluster.go
+++ b/server/etcdserver/api/membership/cluster.go
@@ -816,6 +816,7 @@ func (c *RaftCluster) SetDowngradeInfo(d *serverversion.DowngradeInfo, shouldApp
 
 // IsMemberExist returns if the member with the given id exists in cluster.
 func (c *RaftCluster) IsMemberExist(id types.ID) bool {
+	defer time.Sleep(time.Second)
 	c.Lock()
 	defer c.Unlock()
 	_, ok := c.members[id]
diff --git a/tests/integration/cluster_test.go b/tests/integration/cluster_test.go
index 29f8ae8dd..852a11e85 100644
--- a/tests/integration/cluster_test.go
+++ b/tests/integration/cluster_test.go
@@ -201,6 +201,56 @@ func TestAddMemberAfterClusterFullRotation(t *testing.T) {
 	clusterMustProgress(t, c.Members)
 }
 
+func TestConcurrentRemoveMember(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(time.Second)
+	removeID := uint64(c.Members[1].Server.MemberID())
+	go func() {
+		time.Sleep(time.Second / 2)
+		c.Members[0].Client.MemberRemove(context.Background(), removeID)
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
+func TestConcurrentMoveLeader(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(time.Second)
+	removeID := uint64(c.Members[1].Server.MemberID())
+	go func() {
+		time.Sleep(time.Second / 2)
+		c.Members[0].Client.MoveLeader(context.Background(), removeID)
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
+func TestConcurrentUnary(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(2 * time.Second)
+	go func() {
+		time.Sleep(time.Second + time.Second/2)
+		c.Members[0].Client.Get(context.Background(), "key")
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), uint64(c.Members[0].Server.MemberID())); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
 // TestIssue2681 ensures we can remove a member then add a new one back immediately.
 func TestIssue2681(t *testing.T) {
 	integration.BeforeTest(t)

Here are the stack traces when running these tests:

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentRemoveMember' '-p=2')
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x145df4d]

goroutine 218 [running]:
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).mayRemoveMember(0xc0003d3808, 0xa5a138a8b8d2b107)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1596 +0xed
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).RemoveMember(0xc0003d3808, {0x1cf91e0, 0xc001dbe540}, 0xa5a138a8b8d2b107)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1428 +0x85
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*ClusterServer).MemberRemove(0xc000012ae0, {0x1cf91e0, 0xc001dbe540}, 0xc001dbe570)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/member.go:71 +0xad


% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentMoveLeader' '-p=2')
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1459bed]

goroutine 207 [running]:
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).MoveLeader(0xc0003fb508, {0x1cf91e0, 0xc001e33bc0}, 0x984e59084a1f2c8b, 0xb3a11ebbc585e63a)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1227 +0xcd
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*maintenanceServer).MoveLeader(0xc00041f790, {0x1cf91e0, 0xc001e33bc0}, 0xc001e33bf0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/maintenance.go:282 +0x116
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*authMaintenanceServer).MoveLeader(0xc000600ef0, {0x1cf91e0, 0xc001e33bc0}, 0xc001e33bf0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/maintenance.go:347 +0xb1


% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentUnary' '-p=2')
panic: failed to find local ID in cluster members

goroutine 365 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2, 0xc000fa95f0, {0xc000229a00?, 0x2?, 0x2?})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x9f
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000fa95f0, {0xc000229980, 0x2, 0x2})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ad
go.uber.org/zap.(*Logger).Panic(0xc000084a00, {0x1aa86bb, 0x2a}, {0xc000229980, 0x2, 0x2})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/logger.go:285 +0x68
go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).IsLocalMemberLearner(0xc0003a04d0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/membership/cluster.go:786 +0x35b
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).IsLearner(0xc000295508)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:2471 +0x45
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.newUnaryInterceptor.func4({0x1cf91e0, 0xc001ed27b0}, {0x1a3ec60, 0xc000002e10}, 0xc001ea4320?, 0xc001eab540)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/interceptor.go:52 +0x9a

@ahrtr
Copy link
Member

ahrtr commented Sep 24, 2024

Thanks for the integration test cases.

I agree that it may happen theoretically. Did you ever see the panic in production or your test environment with the official etcd releases? I believe not. So overall minor issues to me.

The change to server/etcdserver/server.go is straightforward and safe, please also add test cases TestConcurrentMoveLeader and TestConcurrentRemoveMember into this PR. You can use the gofailpoint to trigger the sleep failpoint. See an example below

require.NoError(t, gofail.Enable(fpName, `sleep("3s")`))

For the change to server/etcdserver/api/v3rpc/interceptor.go, I prefer to keep it unchanged,

  • It changes the behaviour/protocal between etcdserver and client.
  • Also it makes more sense to return an error something "member not found" instead of ErrGRPCNotSupportedForLearner if it's removed in-between;
  • When the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious.

@jscissr
Copy link
Contributor Author

jscissr commented Sep 25, 2024

I found the bug by reading the code, and have indeed not observed it happen without the added delay.

I have added the TestConcurrentMoveLeader and TestConcurrentRemoveMember tests to the PR, with failpoint and reliability improvements. Though, I'm not sure that adding these tests to the codebase has much value. The only way they could catch a bug is if someone adds back calls to IsMemberExist in these functions.

For server/etcdserver/api/v3rpc/interceptor.go, I don't understand your concern. The existing behavior is left unchanged by my changes, they only reduce the set of possible behaviors by removing the possibility of panicking. I changed IsLocalMemberLearner to return false when the local node is not a member. This means that s.IsMemberExist(s.MemberID()) && s.IsLearner() is equivalent to s.IsLearner().

Comment on lines 786 to 790
c.lg.Panic(
"failed to find local ID in cluster members",
zap.String("cluster-id", c.cid.String()),
zap.String("local-member-id", c.localID.String()),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest not to change this. I don't think it will happen in production or test environment. If it happens, it means something critical occurs.

@ahrtr
Copy link
Member

ahrtr commented Sep 26, 2024

For server/etcdserver/api/v3rpc/interceptor.go, I don't understand your concern. The existing behavior is left unchanged by my changes, they only reduce the set of possible behaviors by removing the possibility of panicking.

Please read #18606 (comment), and also comments below,

  • Previously etcdserver panics in such case, but now it returns a rpctypes.ErrGRPCNotSupportedForLearner error. So it changes the behaviour, and it may mislead the users.
  • Also an error something like "member not found" is more proper in such case.

Also as mentioned previously, when the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious.

So I suggest not to change

  • server/etcdserver/api/v3rpc/interceptor.go
  • server/etcdserver/api/membership/cluster.go

Comment on lines 815 to 817
// gofail: var sleepAfterIsMemberExist struct{}
// defer time.Sleep(time.Second)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a failpoint right above the line 821 "return ok",

// gofail: var sleepAfterIsMemberExist struct{}

and inject sleep("1s") to it during test

There was a concurrency bug when accessing the IsLearner
property of a member, which will panic with a nil pointer access error
if the member is removed between the IsMemberExist() and Member() calls.

Signed-off-by: Jan Schär <[email protected]>
@jscissr
Copy link
Contributor Author

jscissr commented Sep 30, 2024

I removed the changes which you dislike, and adjusted the failpoint.

@ahrtr
Copy link
Member

ahrtr commented Sep 30, 2024

/ok-to-test

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks

@@ -518,3 +518,51 @@ func TestSpeedyTerminate(t *testing.T) {
case <-donec:
}
}

// TestConcurrentRemoveMember demonstrated a panic in mayRemoveMember with
// concurrent calls to MemberRemove. To reliably reproduce the panic, a delay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should never be a concurrent call to MemberRemove. Cluster state changes should be all done via apply loop, which guarantees that there is only one concurrent change to cluster state.

The reason for locking in cluster struct (and any other internal structs) is to prevent reader+writer races. There should never be writer+writer race.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has nothing to do with the apply loop/workflow. It's fixing a panic in the API layer. Refer to #18606 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't notice that this is integration test. I'm just little confused why MemberRemove call by client is executed concurrently. Should it be linearized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just little confused why MemberRemove call by client is executed concurrently.

Right, usually we don't expect users to do this. But etcdserver shouldn't panic due to whatever users' inappropriate behaviour. Note that the panicking happens at API layer instead of the apply loop/workflow.

Copy link
Member

@serathius serathius Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't mean "why?", but "how?". Agree that this is a problem, but I'm trying to understand how concurrency bug in cluster struct surfaces to API.

Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent of discussion, this change is good to merge.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, jscissr, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr ahrtr merged commit d012386 into etcd-io:main Oct 4, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants