Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: fix panic when checking IsLearner of removed member #18606

Merged
merged 1 commit into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion server/etcdserver/api/membership/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -817,8 +817,10 @@ func (c *RaftCluster) SetDowngradeInfo(d *serverversion.DowngradeInfo, shouldApp
// IsMemberExist returns if the member with the given id exists in cluster.
func (c *RaftCluster) IsMemberExist(id types.ID) bool {
c.Lock()
defer c.Unlock()
_, ok := c.members[id]
c.Unlock()

// gofail: var afterIsMemberExist struct{}
return ok
}

Expand Down
7 changes: 4 additions & 3 deletions server/etcdserver/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -1224,7 +1224,8 @@ func (s *EtcdServer) isLeader() bool {

// MoveLeader transfers the leader to the given transferee.
func (s *EtcdServer) MoveLeader(ctx context.Context, lead, transferee uint64) error {
if !s.cluster.IsMemberExist(types.ID(transferee)) || s.cluster.Member(types.ID(transferee)).IsLearner {
member := s.cluster.Member(types.ID(transferee))
if member == nil || member.IsLearner {
return errors.ErrBadLeaderTransferee
}

Expand Down Expand Up @@ -1593,9 +1594,9 @@ func (s *EtcdServer) mayRemoveMember(id types.ID) error {
}

lg := s.Logger()
isLearner := s.cluster.IsMemberExist(id) && s.cluster.Member(id).IsLearner
member := s.cluster.Member(id)
// no need to check quorum when removing non-voting member
if isLearner {
if member != nil && member.IsLearner {
return nil
}

Expand Down
48 changes: 48 additions & 0 deletions tests/integration/cluster_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -518,3 +518,51 @@ func TestSpeedyTerminate(t *testing.T) {
case <-donec:
}
}

// TestConcurrentRemoveMember demonstrated a panic in mayRemoveMember with
// concurrent calls to MemberRemove. To reliably reproduce the panic, a delay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should never be a concurrent call to MemberRemove. Cluster state changes should be all done via apply loop, which guarantees that there is only one concurrent change to cluster state.

The reason for locking in cluster struct (and any other internal structs) is to prevent reader+writer races. There should never be writer+writer race.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has nothing to do with the apply loop/workflow. It's fixing a panic in the API layer. Refer to #18606 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't notice that this is integration test. I'm just little confused why MemberRemove call by client is executed concurrently. Should it be linearized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just little confused why MemberRemove call by client is executed concurrently.

Right, usually we don't expect users to do this. But etcdserver shouldn't panic due to whatever users' inappropriate behaviour. Note that the panicking happens at API layer instead of the apply loop/workflow.

Copy link
Member

@serathius serathius Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't mean "why?", but "how?". Agree that this is a problem, but I'm trying to understand how concurrency bug in cluster struct surfaces to API.

// needed to be injected in IsMemberExist, which is done using a failpoint.
// After fixing the bug, IsMemberExist is no longer called by mayRemoveMember.
func TestConcurrentRemoveMember(t *testing.T) {
integration.BeforeTest(t, integration.WithFailpoint("afterIsMemberExist", `sleep("1s")`))
c := integration.NewCluster(t, &integration.ClusterConfig{Size: 1})
defer c.Terminate(t)

addResp, err := c.Members[0].Client.MemberAddAsLearner(context.Background(), []string{"http://localhost:123"})
if err != nil {
t.Fatal(err)
}
removeID := addResp.Member.ID
done := make(chan struct{})
go func() {
time.Sleep(time.Second / 2)
c.Members[0].Client.MemberRemove(context.Background(), removeID)
close(done)
}()
if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
t.Fatal(err)
}
<-done
}

func TestConcurrentMoveLeader(t *testing.T) {
integration.BeforeTest(t, integration.WithFailpoint("afterIsMemberExist", `sleep("1s")`))
c := integration.NewCluster(t, &integration.ClusterConfig{Size: 1})
defer c.Terminate(t)

addResp, err := c.Members[0].Client.MemberAddAsLearner(context.Background(), []string{"http://localhost:123"})
if err != nil {
t.Fatal(err)
}
removeID := addResp.Member.ID
done := make(chan struct{})
go func() {
time.Sleep(time.Second / 2)
c.Members[0].Client.MoveLeader(context.Background(), removeID)
close(done)
}()
if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
t.Fatal(err)
}
<-done
}
4 changes: 2 additions & 2 deletions tests/robustness/makefile.mk
Original file line number Diff line number Diff line change
Expand Up @@ -54,15 +54,15 @@ install-gofail: $(GOPATH)/bin/gofail

.PHONY: gofail-enable
gofail-enable: $(GOPATH)/bin/gofail
$(GOPATH)/bin/gofail enable server/etcdserver/ server/lease/leasehttp server/storage/backend/ server/storage/mvcc/ server/storage/wal/ server/etcdserver/api/v3rpc/
$(GOPATH)/bin/gofail enable server/etcdserver/ server/lease/leasehttp server/storage/backend/ server/storage/mvcc/ server/storage/wal/ server/etcdserver/api/v3rpc/ server/etcdserver/api/membership/
cd ./server && go get go.etcd.io/gofail@${GOFAIL_VERSION}
cd ./etcdutl && go get go.etcd.io/gofail@${GOFAIL_VERSION}
cd ./etcdctl && go get go.etcd.io/gofail@${GOFAIL_VERSION}
cd ./tests && go get go.etcd.io/gofail@${GOFAIL_VERSION}

.PHONY: gofail-disable
gofail-disable: $(GOPATH)/bin/gofail
$(GOPATH)/bin/gofail disable server/etcdserver/ server/lease/leasehttp server/storage/backend/ server/storage/mvcc/ server/storage/wal/ server/etcdserver/api/v3rpc/
$(GOPATH)/bin/gofail disable server/etcdserver/ server/lease/leasehttp server/storage/backend/ server/storage/mvcc/ server/storage/wal/ server/etcdserver/api/v3rpc/ server/etcdserver/api/membership/
cd ./server && go mod tidy
cd ./etcdutl && go mod tidy
cd ./etcdctl && go mod tidy
Expand Down
Loading