Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/members peer handler is not guarded with linearizable check #16687

Open
4 tasks done
chaochn47 opened this issue Oct 3, 2023 · 5 comments
Open
4 tasks done

/members peer handler is not guarded with linearizable check #16687

chaochn47 opened this issue Oct 3, 2023 · 5 comments

Comments

@chaochn47
Copy link
Member

Bug report criteria

What happened?

Restart etcd server right after member reconfiguration and query the member list via HTTP /members, its handler will bypass the linearizable check. The member list response would be stale during bootstrap (restart) where the members are restored from v2 store and WAL is still replaying...

What did you expect to happen?

When peer certificate is not used, /members should be protected by linearizable check.

How can we reproduce it (as minimally and precisely as possible)?

  1. checkout code in https://github.com/chaochn47/etcd/tree/v3.4.20-eks.0-reproduce-member-name-mismatch
  2. cd integration && go test -v -run TestReproduceMemberNameMismatch

Anything else we need to know?

Relevant to discussion

Restart etcd server right after member reconfiguration and query the member list via HTTP /members, its handler will bypass the linearizable check.

Actually the main also has this "issue", but this might not be a problem. Two reasons:

* The `/members` is only supposed to accessed during bootstrap by a new etcd member who has just been added to the cluster;

* The `/members` is protected by peer certificate; in other words, users shouldn't be able to access this endpoint at all.

EDIT: but I still think it may be better to guard the /members by linearizable check. Please raise a separate issue to track it, and we can discuss it separately.

#16666 (comment)

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@chaochn47
Copy link
Member Author

/cc @ahrtr

@chaochn47 chaochn47 changed the title Guard /members peer handler with linearizable check /members peer handler is not guarded with linearizable check Oct 3, 2023
@serathius
Copy link
Member

QQ, why we have a non-grpc endpoint for listing members? Can someone do some archaeology because for me it seems like legacy from v2 API.

@chaochn47 chaochn47 self-assigned this Oct 3, 2023
@chaochn47
Copy link
Member Author

chaochn47 commented Oct 4, 2023

From what I can tell, it is used in runtime reconfiguration (adding a new member).

existingCluster, gerr := GetClusterFromRemotePeers(cfg.Logger, getRemotePeerURLs(cl, cfg.Name), prt)

When the new member process starts, it needs to verify the existing cluster ID matches the local --initial-cluster configuration, assign ID to this new member and ID is from the remote, validate the max learners does not exceed the configuration... etc.

@serathius
Copy link
Member

This looks like code used to bootstrap member joining a existing cluster. For example when increasing cluster size. From brief look, call to members is used to get mapping between member names and ids.

I would need to double check how fetching of those ids are later synchronized with raft, but I suspect that getting the freshest data might not be necessary as long as raft is properly initiated and replayed.

As this is internal endpoint our only worry should be whether this stale data here could impact member bootstrap. We should be very careful when adding a linearizability requirement here. It could bring more harm than benefit. First I would want to see that we are able to find and reproduce issue with etcd member bootstrap that could be caused by stale data returned on /members endpoint.

@chaochn47
Copy link
Member Author

chaochn47 commented Oct 4, 2023

It could bring more harm than benefit.

Agree. Here is a historical issue I opened #14174 with reproduce. The conclusion is #14174 (comment)

Discussed in #14175, the feasible solution in short term is

* Add retry in client side to ensure the membership reconfiguration applied to all members.

* Wait until retry succeeds, start new member.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants