-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: fix panic when checking IsLearner of removed member #18606
Conversation
Hi @jscissr. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files
... and 21 files with indirect coverage changes @@ Coverage Diff @@
## main #18606 +/- ##
==========================================
- Coverage 68.77% 68.74% -0.04%
==========================================
Files 420 420
Lines 35535 35536 +1
==========================================
- Hits 24439 24428 -11
- Misses 9666 9678 +12
Partials 1430 1430 Continue to review full report in Codecov by Sentry.
|
Did you see a case where panic happened or can you create an e2e or integration test case to make it happen? |
Yes I can, here are integration tests which demonstrate the panic. I need to add an artificial delay in IsMemberExist to reliably show the panic. diff --git a/server/etcdserver/api/membership/cluster.go b/server/etcdserver/api/membership/cluster.go
index 6becdfd62..4b6dbda64 100644
--- a/server/etcdserver/api/membership/cluster.go
+++ b/server/etcdserver/api/membership/cluster.go
@@ -816,6 +816,7 @@ func (c *RaftCluster) SetDowngradeInfo(d *serverversion.DowngradeInfo, shouldApp
// IsMemberExist returns if the member with the given id exists in cluster.
func (c *RaftCluster) IsMemberExist(id types.ID) bool {
+ defer time.Sleep(time.Second)
c.Lock()
defer c.Unlock()
_, ok := c.members[id]
diff --git a/tests/integration/cluster_test.go b/tests/integration/cluster_test.go
index 29f8ae8dd..852a11e85 100644
--- a/tests/integration/cluster_test.go
+++ b/tests/integration/cluster_test.go
@@ -201,6 +201,56 @@ func TestAddMemberAfterClusterFullRotation(t *testing.T) {
clusterMustProgress(t, c.Members)
}
+func TestConcurrentRemoveMember(t *testing.T) {
+ integration.BeforeTest(t)
+ c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+ defer c.Terminate(t)
+
+ time.Sleep(time.Second)
+ removeID := uint64(c.Members[1].Server.MemberID())
+ go func() {
+ time.Sleep(time.Second / 2)
+ c.Members[0].Client.MemberRemove(context.Background(), removeID)
+ }()
+ if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+ t.Fatal(err)
+ }
+ time.Sleep(time.Second)
+}
+
+func TestConcurrentMoveLeader(t *testing.T) {
+ integration.BeforeTest(t)
+ c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+ defer c.Terminate(t)
+
+ time.Sleep(time.Second)
+ removeID := uint64(c.Members[1].Server.MemberID())
+ go func() {
+ time.Sleep(time.Second / 2)
+ c.Members[0].Client.MoveLeader(context.Background(), removeID)
+ }()
+ if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+ t.Fatal(err)
+ }
+ time.Sleep(time.Second)
+}
+
+func TestConcurrentUnary(t *testing.T) {
+ integration.BeforeTest(t)
+ c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+ defer c.Terminate(t)
+
+ time.Sleep(2 * time.Second)
+ go func() {
+ time.Sleep(time.Second + time.Second/2)
+ c.Members[0].Client.Get(context.Background(), "key")
+ }()
+ if _, err := c.Members[0].Client.MemberRemove(context.Background(), uint64(c.Members[0].Server.MemberID())); err != nil {
+ t.Fatal(err)
+ }
+ time.Sleep(time.Second)
+}
+
// TestIssue2681 ensures we can remove a member then add a new one back immediately.
func TestIssue2681(t *testing.T) {
integration.BeforeTest(t) Here are the stack traces when running these tests:
|
Thanks for the integration test cases. I agree that it may happen theoretically. Did you ever see the panic in production or your test environment with the official etcd releases? I believe not. So overall minor issues to me. The change to etcd/tests/integration/v3_lease_test.go Line 1086 in 59cfd7a
For the change to
|
d61be85
to
605abca
Compare
I found the bug by reading the code, and have indeed not observed it happen without the added delay. I have added the For |
c.lg.Panic( | ||
"failed to find local ID in cluster members", | ||
zap.String("cluster-id", c.cid.String()), | ||
zap.String("local-member-id", c.localID.String()), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest not to change this. I don't think it will happen in production or test environment. If it happens, it means something critical occurs.
Please read #18606 (comment), and also comments below,
Also as mentioned previously, when the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious. So I suggest not to change
|
// gofail: var sleepAfterIsMemberExist struct{} | ||
// defer time.Sleep(time.Second) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a failpoint right above the line 821 "return ok
",
// gofail: var sleepAfterIsMemberExist struct{}
and inject sleep("1s")
to it during test
There was a concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls. Signed-off-by: Jan Schär <[email protected]>
605abca
to
3374e27
Compare
I removed the changes which you dislike, and adjusted the failpoint. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks
@@ -518,3 +518,51 @@ func TestSpeedyTerminate(t *testing.T) { | |||
case <-donec: | |||
} | |||
} | |||
|
|||
// TestConcurrentRemoveMember demonstrated a panic in mayRemoveMember with | |||
// concurrent calls to MemberRemove. To reliably reproduce the panic, a delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should never be a concurrent call to MemberRemove
. Cluster state changes should be all done via apply loop, which guarantees that there is only one concurrent change to cluster state.
The reason for locking in cluster struct (and any other internal structs) is to prevent reader+writer races. There should never be writer+writer race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has nothing to do with the apply loop/workflow. It's fixing a panic in the API layer. Refer to #18606 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't notice that this is integration test. I'm just little confused why MemberRemove call by client is executed concurrently. Should it be linearized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just little confused why MemberRemove call by client is executed concurrently.
Right, usually we don't expect users to do this. But etcdserver shouldn't panic due to whatever users' inappropriate behaviour. Note that the panicking happens at API layer instead of the apply loop/workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't mean "why?", but "how?". Agree that this is a problem, but I'm trying to understand how concurrency bug in cluster struct surfaces to API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent of discussion, this change is good to merge.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, jscissr, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Previously, calling s.IsLearner() when the local node is no longer a member panics. There was an attempt to fix this by first checking IsMemberExist(), but this is not a correct fix because the member could be removed between the two calls. Instead of panicking when the member was removed, IsLearner() should return false. A node which is not a member is also not a learner.
There was a similar concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls.
I did not add a unit test because it's basically impossible to test for such concurrency bugs.