etcdserver: fix panic when checking IsLearner of removed member #18606

jscissr · 2024-09-19T09:26:27Z

Previously, calling s.IsLearner() when the local node is no longer a member panics. There was an attempt to fix this by first checking IsMemberExist(), but this is not a correct fix because the member could be removed between the two calls. Instead of panicking when the member was removed, IsLearner() should return false. A node which is not a member is also not a learner.

There was a similar concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls.

I did not add a unit test because it's basically impossible to test for such concurrency bugs.

k8s-ci-robot · 2024-09-19T09:26:37Z

Hi @jscissr. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov-commenter · 2024-09-19T18:29:29Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.74%. Comparing base (5704c61) to head (85dfd2d).
Report is 8 commits behind head on main.

❗ Current head 85dfd2d differs from pull request most recent head 3374e27

Please upload reports for the commit 3374e27 to get more accurate results.

Files with missing lines	Patch %	Lines
client/v3/client.go	0.00%	1 Missing ⚠️
server/etcdserver/server.go	75.00%	0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

Files with missing lines	Coverage Δ
client/v3/retry_interceptor.go	`65.61% <100.00%> (ø)`
server/etcdserver/api/membership/cluster.go	`88.53% <100.00%> (ø)`
client/v3/client.go	`84.89% <0.00%> (ø)`
server/etcdserver/server.go	`81.28% <75.00%> (+0.01%)`	⬆️

... and 21 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18606      +/-   ##
==========================================
- Coverage   68.77%   68.74%   -0.04%     
==========================================
  Files         420      420              
  Lines       35535    35536       +1     
==========================================
- Hits        24439    24428      -11     
- Misses       9666     9678      +12     
  Partials     1430     1430

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5704c61...3374e27. Read the comment docs.

ahrtr · 2024-09-22T18:52:11Z

Previously, calling s.IsLearner() when the local node is no longer a member panics.

Did you see a case where panic happened or can you create an e2e or integration test case to make it happen?

jscissr · 2024-09-23T11:44:23Z

Did you see a case where panic happened or can you create an e2e or integration test case to make it happen?

Yes I can, here are integration tests which demonstrate the panic. I need to add an artificial delay in IsMemberExist to reliably show the panic.

diff --git a/server/etcdserver/api/membership/cluster.go b/server/etcdserver/api/membership/cluster.go
index 6becdfd62..4b6dbda64 100644
--- a/server/etcdserver/api/membership/cluster.go
+++ b/server/etcdserver/api/membership/cluster.go
@@ -816,6 +816,7 @@ func (c *RaftCluster) SetDowngradeInfo(d *serverversion.DowngradeInfo, shouldApp
 
 // IsMemberExist returns if the member with the given id exists in cluster.
 func (c *RaftCluster) IsMemberExist(id types.ID) bool {
+	defer time.Sleep(time.Second)
 	c.Lock()
 	defer c.Unlock()
 	_, ok := c.members[id]
diff --git a/tests/integration/cluster_test.go b/tests/integration/cluster_test.go
index 29f8ae8dd..852a11e85 100644
--- a/tests/integration/cluster_test.go
+++ b/tests/integration/cluster_test.go
@@ -201,6 +201,56 @@ func TestAddMemberAfterClusterFullRotation(t *testing.T) {
 	clusterMustProgress(t, c.Members)
 }
 
+func TestConcurrentRemoveMember(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(time.Second)
+	removeID := uint64(c.Members[1].Server.MemberID())
+	go func() {
+		time.Sleep(time.Second / 2)
+		c.Members[0].Client.MemberRemove(context.Background(), removeID)
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
+func TestConcurrentMoveLeader(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(time.Second)
+	removeID := uint64(c.Members[1].Server.MemberID())
+	go func() {
+		time.Sleep(time.Second / 2)
+		c.Members[0].Client.MoveLeader(context.Background(), removeID)
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
+func TestConcurrentUnary(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(2 * time.Second)
+	go func() {
+		time.Sleep(time.Second + time.Second/2)
+		c.Members[0].Client.Get(context.Background(), "key")
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), uint64(c.Members[0].Server.MemberID())); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
 // TestIssue2681 ensures we can remove a member then add a new one back immediately.
 func TestIssue2681(t *testing.T) {
 	integration.BeforeTest(t)

Here are the stack traces when running these tests:

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentRemoveMember' '-p=2')
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x145df4d]

goroutine 218 [running]:
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).mayRemoveMember(0xc0003d3808, 0xa5a138a8b8d2b107)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1596 +0xed
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).RemoveMember(0xc0003d3808, {0x1cf91e0, 0xc001dbe540}, 0xa5a138a8b8d2b107)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1428 +0x85
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*ClusterServer).MemberRemove(0xc000012ae0, {0x1cf91e0, 0xc001dbe540}, 0xc001dbe570)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/member.go:71 +0xad


% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentMoveLeader' '-p=2')
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1459bed]

goroutine 207 [running]:
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).MoveLeader(0xc0003fb508, {0x1cf91e0, 0xc001e33bc0}, 0x984e59084a1f2c8b, 0xb3a11ebbc585e63a)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1227 +0xcd
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*maintenanceServer).MoveLeader(0xc00041f790, {0x1cf91e0, 0xc001e33bc0}, 0xc001e33bf0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/maintenance.go:282 +0x116
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*authMaintenanceServer).MoveLeader(0xc000600ef0, {0x1cf91e0, 0xc001e33bc0}, 0xc001e33bf0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/maintenance.go:347 +0xb1


% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentUnary' '-p=2')
panic: failed to find local ID in cluster members

goroutine 365 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2, 0xc000fa95f0, {0xc000229a00?, 0x2?, 0x2?})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x9f
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000fa95f0, {0xc000229980, 0x2, 0x2})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ad
go.uber.org/zap.(*Logger).Panic(0xc000084a00, {0x1aa86bb, 0x2a}, {0xc000229980, 0x2, 0x2})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/logger.go:285 +0x68
go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).IsLocalMemberLearner(0xc0003a04d0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/membership/cluster.go:786 +0x35b
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).IsLearner(0xc000295508)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:2471 +0x45
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.newUnaryInterceptor.func4({0x1cf91e0, 0xc001ed27b0}, {0x1a3ec60, 0xc000002e10}, 0xc001ea4320?, 0xc001eab540)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/interceptor.go:52 +0x9a

ahrtr · 2024-09-24T15:14:36Z

Thanks for the integration test cases.

I agree that it may happen theoretically. Did you ever see the panic in production or your test environment with the official etcd releases? I believe not. So overall minor issues to me.

The change to server/etcdserver/server.go is straightforward and safe, please also add test cases TestConcurrentMoveLeader and TestConcurrentRemoveMember into this PR. You can use the gofailpoint to trigger the sleep failpoint. See an example below

etcd/tests/integration/v3_lease_test.go

Line 1086 in 59cfd7a

require.NoError(t, gofail.Enable(fpName, `sleep("3s")`))

For the change to server/etcdserver/api/v3rpc/interceptor.go, I prefer to keep it unchanged,

It changes the behaviour/protocal between etcdserver and client.
Also it makes more sense to return an error something "member not found" instead of ErrGRPCNotSupportedForLearner if it's removed in-between;
When the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious.

jscissr · 2024-09-25T16:09:11Z

I found the bug by reading the code, and have indeed not observed it happen without the added delay.

I have added the TestConcurrentMoveLeader and TestConcurrentRemoveMember tests to the PR, with failpoint and reliability improvements. Though, I'm not sure that adding these tests to the codebase has much value. The only way they could catch a bug is if someone adds back calls to IsMemberExist in these functions.

For server/etcdserver/api/v3rpc/interceptor.go, I don't understand your concern. The existing behavior is left unchanged by my changes, they only reduce the set of possible behaviors by removing the possibility of panicking. I changed IsLocalMemberLearner to return false when the local node is not a member. This means that s.IsMemberExist(s.MemberID()) && s.IsLearner() is equivalent to s.IsLearner().

ahrtr · 2024-09-25T17:28:51Z

server/etcdserver/api/membership/cluster.go

-		c.lg.Panic(
-			"failed to find local ID in cluster members",
-			zap.String("cluster-id", c.cid.String()),
-			zap.String("local-member-id", c.localID.String()),
-		)


Suggest not to change this. I don't think it will happen in production or test environment. If it happens, it means something critical occurs.

ahrtr · 2024-09-26T12:41:51Z

For server/etcdserver/api/v3rpc/interceptor.go, I don't understand your concern. The existing behavior is left unchanged by my changes, they only reduce the set of possible behaviors by removing the possibility of panicking.

Please read #18606 (comment), and also comments below,

Previously etcdserver panics in such case, but now it returns a rpctypes.ErrGRPCNotSupportedForLearner error. So it changes the behaviour, and it may mislead the users.
Also an error something like "member not found" is more proper in such case.

Also as mentioned previously, when the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious.

So I suggest not to change

server/etcdserver/api/v3rpc/interceptor.go
server/etcdserver/api/membership/cluster.go

ahrtr · 2024-09-26T12:50:47Z

server/etcdserver/api/membership/cluster.go

+	// gofail: var sleepAfterIsMemberExist struct{}
+	// defer time.Sleep(time.Second)
+


You can add a failpoint right above the line 821 "return ok",

// gofail: var sleepAfterIsMemberExist struct{}

and inject sleep("1s") to it during test

There was a concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls. Signed-off-by: Jan Schär <[email protected]>

jscissr · 2024-09-30T09:26:00Z

I removed the changes which you dislike, and adjusted the failpoint.

ahrtr · 2024-09-30T13:57:54Z

/ok-to-test

ahrtr

LGTM

Thanks

serathius · 2024-10-03T12:53:06Z

tests/integration/cluster_test.go

@@ -518,3 +518,51 @@ func TestSpeedyTerminate(t *testing.T) {
 	case <-donec:
 	}
 }
+
+// TestConcurrentRemoveMember demonstrated a panic in mayRemoveMember with
+// concurrent calls to MemberRemove. To reliably reproduce the panic, a delay


There should never be a concurrent call to MemberRemove. Cluster state changes should be all done via apply loop, which guarantees that there is only one concurrent change to cluster state.

The reason for locking in cluster struct (and any other internal structs) is to prevent reader+writer races. There should never be writer+writer race.

This PR has nothing to do with the apply loop/workflow. It's fixing a panic in the API layer. Refer to #18606 (comment)

Oh, I didn't notice that this is integration test. I'm just little confused why MemberRemove call by client is executed concurrently. Should it be linearized?

I'm just little confused why MemberRemove call by client is executed concurrently.

Right, usually we don't expect users to do this. But etcdserver shouldn't panic due to whatever users' inappropriate behaviour. Note that the panicking happens at API layer instead of the apply loop/workflow.

Sorry, didn't mean "why?", but "how?". Agree that this is a problem, but I'm trying to understand how concurrency bug in cluster struct surfaces to API.

serathius

Independent of discussion, this change is good to merge.

k8s-ci-robot · 2024-10-04T09:40:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, jscissr, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahrtr,serathius]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the needs-ok-to-test label Sep 19, 2024

k8s-ci-robot added the size/S label Sep 19, 2024

jscissr force-pushed the fix-islearner-panic branch from d61be85 to 605abca Compare September 25, 2024 15:32

k8s-ci-robot added area/robustness-testing area/testing size/M and removed size/S labels Sep 25, 2024

ahrtr reviewed Sep 25, 2024

View reviewed changes

ahrtr reviewed Sep 26, 2024

View reviewed changes

jscissr force-pushed the fix-islearner-panic branch from 605abca to 3374e27 Compare September 30, 2024 09:18

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Sep 30, 2024

ahrtr approved these changes Sep 30, 2024

View reviewed changes

k8s-ci-robot added the approved label Sep 30, 2024

serathius reviewed Oct 3, 2024

View reviewed changes

serathius approved these changes Oct 4, 2024

View reviewed changes

ahrtr merged commit d012386 into etcd-io:main Oct 4, 2024
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: fix panic when checking IsLearner of removed member #18606

etcdserver: fix panic when checking IsLearner of removed member #18606

jscissr commented Sep 19, 2024

k8s-ci-robot commented Sep 19, 2024

codecov-commenter commented Sep 19, 2024 •

edited

Loading

ahrtr commented Sep 22, 2024

jscissr commented Sep 23, 2024

ahrtr commented Sep 24, 2024 •

edited

Loading

jscissr commented Sep 25, 2024

ahrtr Sep 25, 2024

ahrtr commented Sep 26, 2024

ahrtr Sep 26, 2024

jscissr commented Sep 30, 2024

ahrtr commented Sep 30, 2024

ahrtr left a comment

serathius Oct 3, 2024

ahrtr Oct 3, 2024

serathius Oct 4, 2024

ahrtr Oct 4, 2024

serathius Oct 4, 2024 •

edited

Loading

serathius left a comment

k8s-ci-robot commented Oct 4, 2024

		// gofail: var sleepAfterIsMemberExist struct{}
		// defer time.Sleep(time.Second)

etcdserver: fix panic when checking IsLearner of removed member #18606

etcdserver: fix panic when checking IsLearner of removed member #18606

Conversation

jscissr commented Sep 19, 2024

k8s-ci-robot commented Sep 19, 2024

codecov-commenter commented Sep 19, 2024 • edited Loading

Codecov Report

ahrtr commented Sep 22, 2024

jscissr commented Sep 23, 2024

ahrtr commented Sep 24, 2024 • edited Loading

jscissr commented Sep 25, 2024

ahrtr Sep 25, 2024

Choose a reason for hiding this comment

ahrtr commented Sep 26, 2024

ahrtr Sep 26, 2024

Choose a reason for hiding this comment

jscissr commented Sep 30, 2024

ahrtr commented Sep 30, 2024

ahrtr left a comment

Choose a reason for hiding this comment

serathius Oct 3, 2024

Choose a reason for hiding this comment

ahrtr Oct 3, 2024

Choose a reason for hiding this comment

serathius Oct 4, 2024

Choose a reason for hiding this comment

ahrtr Oct 4, 2024

Choose a reason for hiding this comment

serathius Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

serathius left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 4, 2024

codecov-commenter commented Sep 19, 2024 •

edited

Loading

ahrtr commented Sep 24, 2024 •

edited

Loading

serathius Oct 4, 2024 •

edited

Loading