Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--experimental-wait-cluster-ready-timeout causing stale response to linearizable read #16666

Closed
serathius opened this issue Sep 29, 2023 · 14 comments
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/bug

Comments

@serathius
Copy link
Member

serathius commented Sep 29, 2023

While working and discussing #16658 with @ahrtr I spotted one issue. Etcd going back in time during bootstrap. The good news is that the current reproduction limit the issue to v3.6 release. Impact on older releases is still under investigation.

As described in #16658 (comment) graceful shutdown via SIGTERM allows etcd flushing it's database to disk, however SIGKILL will mean that data on disk might be older than in memory state of etcd. While bootstrapping etcd will catch up on changes it has forgotten by replaying WAL. Etcd might go back in time if it started serving data before it caught up to state before the kill. But this doesn't happen right?

Unfortunately it's possible. In #13525 we have added a flag that skips the wait. Fortunately the change is only present on v3.6. --experimental-wait-cluster-ready-timeout goal was to skip the wait for quorum during member bootstrap allowing member to start serving serializable requests. This unfortunately also skips the wait for entries from before crash to be applied. The skip will happen by default, fortunately 5s wait is enough to prevent this happening in most cases.

After looking deeper I found more issues.

  • [confirmed] --experimental-wait-cluster-ready-timeout causing member serializable responses to go back in time. Repro 51fb2d7 It's done by reusing TestCtlV3ConsistentMemberList test introduced in Fix memberList may return incorrect intermediate results right after bootstrap #16658 and uses SIGKILL.
  • [confirmed] --experimental-wait-cluster-ready-timeout breaking linearizability in single node cluster. In single node cluster, leader will not check readIndex the same way. It will trust it's local committed index, which is not guaranteed to be persisted before responding to user. Repro Reproduce --wait-cluster-ready-timeout flag causing linearizability issue #16672
  • Even without --experimental-wait-cluster-ready-timeout, clusterReady mechanism seems strange. It just waits for a joining member to be able to make a proposal (set it's own member attributes), however this has noting to do with replaying entries. I expect that with proper timings (injecting sleep into code) the etcd will start serving before all entries from before crash have been applied. I expect it can happen on both single and multi node cluster.
@serathius
Copy link
Member Author

cc @gyuho @jpbetz @philips @hexfusion @bdarnell

Would be great if to get your opinion.

@ahrtr
Copy link
Member

ahrtr commented Sep 29, 2023

I will try to create a dedicated test case to reproduce this issue on key space next week firstly.

@serathius
Copy link
Member Author

Hmm, doesn't reproduce for me. I was trying to use robustness tests on 1 node cluster with serializable request. Found a bug in the test though #16658 (comment)

@serathius
Copy link
Member Author

serathius commented Sep 30, 2023

Ok, found it #16672 but it's worse. Looks like single node lineralizability issue.

@serathius serathius changed the title Responses for serializable requests should not go back in time --experimental-wait-cluster-ready-timeout causing single node linearizability issue Oct 1, 2023
@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 1, 2023
@serathius
Copy link
Member Author

Summarized the current findings in the top comment. Thanks to @ahrtr who already prepared two different fix proposals (etcd-io/raft#105, #16675). I think the proper solution will require making some trade-off, so I think we should revisit the design of etcd ready code.

@serathius serathius changed the title --experimental-wait-cluster-ready-timeout causing single node linearizability issue --experimental-wait-cluster-ready-timeout causing stale response to linearizable read Oct 2, 2023
@serathius
Copy link
Member Author

Did some testing for the third case however without success. Possibly I'm missing some dependency between cluster readiness and applying raft. If there was this kind of issue in etcd, we would have caught it earlier. I will followup on this in #16673 to ensure we properly test and find all races in etcd bootstrap.

Still the issue of --experimental-wait-cluster-ready-timeout stays. @ahrtr has proposed some fixes to this, however they come with some downsides. My suggestion would be to remove --experimental-wait-cluster-ready-timeout first and then reconsider it after we have a proper design for etcd bootstrap that is linearizable.

@ahrtr
Copy link
Member

ahrtr commented Oct 2, 2023

Just added two test cases which can reproduce the issue super easily.

Rollbacking #13525 is the most pragmatic approach. It can resolve this issue, but it will also rollback the fix for the following two issues:

  • etcd can't serve serializable requests after restarts if the quorum isn't satisfied;
  • K8s liveness probe might restart an etcd node while other nodes are not started yet or still in progress of starting. Accordingly it will make the situation even worse, and lead to vicious circle.

The first issue seems not a big deal. But the second one is a major issue, we should take care of in the /livez and /readyz feature. Specifically, when a node is blocked on the ReadyNotify, the /livez should NOT return not live status to client.

@chaochn47
Copy link
Member

chaochn47 commented Oct 3, 2023

Thanks @serathius @ahrtr taking care of this, I believe I have seen this issue in the release-3.4. The reproduce is simple. Restart etcd server right after member reconfiguration and query the member list via HTTP /members, its handler will bypass the linearizable check. The member list response would be stale during bootstrap (restart) where the members are restored from v2 store and WAL is still replaying...

# checkout code in 
# https://github.com/chaochn47/etcd/tree/v3.4.20-eks.0-reproduce-member-name-mismatch
cd integration && go test -v -run TestReproduceMemberNameMismatch

Adding --experimental-wait-cluster-ready-timeout in 3.6 seems like surfacing this issue up while it could have been protected by publishing local member attribute (name and client URL) to cluster. Note, in etcd client it assumes if the name or clientURL (the member attributes) is empty, then the server has not yet started and added to the endpoints before serving traffic.

etcd/client/v3/client.go

Lines 185 to 200 in 979102f

// Sync synchronizes client's endpoints with the known endpoints from the etcd membership.
func (c *Client) Sync(ctx context.Context) error {
mresp, err := c.MemberList(ctx)
if err != nil {
return err
}
var eps []string
for _, m := range mresp.Members {
if len(m.Name) != 0 && !m.IsLearner {
eps = append(eps, m.ClientURLs...)
}
}
c.SetEndpoints(eps...)
c.lg.Debug("set etcd endpoints by autoSync", zap.Strings("endpoints", eps))
return nil
}
. I bet I have seen this in the other code bases like etcdctl with this assumption.

So my stance is rolling #13525 back for correctness.

/cc @siyuanfoundation @wenjiaswe we might be impacted by the serializable check method as mentioned by #16666 (comment) :/

@ahrtr
Copy link
Member

ahrtr commented Oct 3, 2023

Restart etcd server right after member reconfiguration and query the member list via HTTP /members, its handler will bypass the linearizable check.

Actually the main also has this "issue", but this might not be a problem. Two reasons:

  • The /members is only supposed to accessed during bootstrap by a new etcd member who has just been added to the cluster;
  • The /members is protected by peer certificate; in other words, users shouldn't be able to access this endpoint at all.

EDIT: but I still think it may be better to guard the /members by linearizable check. Please raise a separate issue to track it, and we can discuss it separately.

@chaochn47
Copy link
Member

K8s liveness probe might restart an etcd node while other nodes are not started yet or still in progress of starting. Accordingly it will make the situation even worse, and lead to vicious circle.

But the second one is a major issue, we should take care of in the #16651. Specifically, when a node is blocked on the ReadyNotify, the /livez should NOT return not live status to client.

Sounds like a startup probe is needed in this case since the risk of livez returns not live status only happens during bootstrap. Long latency of etcd bootstrap (bbolt initialization) was also observed in one of the test clusters. #15397 (reply in thread)

Don't want to derail the issue and will follow this up in https://docs.google.com/document/d/1PaUAp76j1X92h3jZF47m32oVlR8Y-p-arB5XOB7Nb6U/edit?usp=sharing

@serathius
Copy link
Member Author

With #16677 merged we can decide on next steps.

@ahrtr I expect you want to re-implement --experimental-wait-cluster-ready-timeout flag. Before that we need to agree on couple of things, which solution you propose gives us the best trade-off and whether serializable requests should provide sequential consistency (not go back in time). Do you want to continue this in this issue, or we can close and open a separate one?

@ahrtr
Copy link
Member

ahrtr commented Oct 4, 2023

We don't have to re-implement --experimental-wait-cluster-ready-timeout.

So "Indefinitely waiting for the ReadyNotify" + "enhance /livez to take care of the corner case" on bootstrap is the safest solution I can think of.

@serathius
Copy link
Member Author

I agree with the conclusion. We can revisit if needed in the future.

Thanks for great collaboration.

Closing as the main issue was fixed and remaining work will be tracked in followups:

serathius added a commit that referenced this issue Oct 7, 2023
Inject sleep during etcd bootstrap to reproduce #16666
@chaochn47
Copy link
Member

Even without --experimental-wait-cluster-ready-timeout, clusterReady mechanism seems strange. It just waits for a joining member to be able to make a proposal (set it's own member attributes), however this has noting to do with replaying entries. I expect that with proper timings (injecting sleep into code) the etcd will start serving before all entries from before crash have been applied. I expect it can happen on both single and multi node cluster.

@serathius My understanding is local node proposals will be appended at the end of the replaying entries queue.

Only when all the previous entries have been replayed, the new proposal to publish its name and clientURLs will succeeds.

case x := <-ch:
. So there should not have a linearizability concern.

However, I think this default readiness is not good enough. The upcoming /readyz and gRPC change should further guarantee that etcd is ready to receive traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/bug
Development

No branches or pull requests

3 participants