-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--experimental-wait-cluster-ready-timeout causing stale response to linearizable read #16666
Comments
cc @gyuho @jpbetz @philips @hexfusion @bdarnell Would be great if to get your opinion. |
I will try to create a dedicated test case to reproduce this issue on key space next week firstly. |
Hmm, doesn't reproduce for me. I was trying to use robustness tests on 1 node cluster with serializable request. Found a bug in the test though #16658 (comment) |
Ok, found it #16672 but it's worse. Looks like single node lineralizability issue. |
Summarized the current findings in the top comment. Thanks to @ahrtr who already prepared two different fix proposals (etcd-io/raft#105, #16675). I think the proper solution will require making some trade-off, so I think we should revisit the design of etcd ready code. |
Did some testing for the third case however without success. Possibly I'm missing some dependency between cluster readiness and applying raft. If there was this kind of issue in etcd, we would have caught it earlier. I will followup on this in #16673 to ensure we properly test and find all races in etcd bootstrap. Still the issue of |
Just added two test cases which can reproduce the issue super easily. Rollbacking #13525 is the most pragmatic approach. It can resolve this issue, but it will also rollback the fix for the following two issues:
The first issue seems not a big deal. But the second one is a major issue, we should take care of in the /livez and /readyz feature. Specifically, when a node is blocked on the ReadyNotify, the |
Thanks @serathius @ahrtr taking care of this, I believe I have seen this issue in the release-3.4. The reproduce is simple. Restart etcd server right after member reconfiguration and query the member list via HTTP
Adding Lines 185 to 200 in 979102f
etcdctl with this assumption.
So my stance is rolling #13525 back for correctness. /cc @siyuanfoundation @wenjiaswe we might be impacted by the serializable check method as mentioned by #16666 (comment) :/ |
Actually the main also has this "issue", but this might not be a problem. Two reasons:
EDIT: but I still think it may be better to guard the |
Sounds like a startup probe is needed in this case since the risk of Don't want to derail the issue and will follow this up in https://docs.google.com/document/d/1PaUAp76j1X92h3jZF47m32oVlR8Y-p-arB5XOB7Nb6U/edit?usp=sharing |
With #16677 merged we can decide on next steps. @ahrtr I expect you want to re-implement |
We don't have to re-implement
So "Indefinitely waiting for the ReadyNotify" + "enhance /livez to take care of the corner case" on bootstrap is the safest solution I can think of. |
I agree with the conclusion. We can revisit if needed in the future. Thanks for great collaboration. Closing as the main issue was fixed and remaining work will be tracked in followups: |
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Inject sleep during etcd bootstrap to reproduce #16666
@serathius My understanding is local node proposals will be appended at the end of the replaying entries queue. Only when all the previous entries have been replayed, the new proposal to publish its name and clientURLs will succeeds. etcd/server/etcdserver/v3_server.go Line 767 in aa97484
However, I think this default readiness is not good enough. The upcoming |
While working and discussing #16658 with @ahrtr I spotted one issue. Etcd going back in time during bootstrap. The good news is that the current reproduction limit the issue to v3.6 release. Impact on older releases is still under investigation.
As described in #16658 (comment) graceful shutdown via SIGTERM allows etcd flushing it's database to disk, however SIGKILL will mean that data on disk might be older than in memory state of etcd. While bootstrapping etcd will catch up on changes it has forgotten by replaying WAL. Etcd might go back in time if it started serving data before it caught up to state before the kill. But this doesn't happen right?
Unfortunately it's possible. In #13525 we have added a flag that skips the wait. Fortunately the change is only present on v3.6.
--experimental-wait-cluster-ready-timeout
goal was to skip the wait for quorum during member bootstrap allowing member to start serving serializable requests. This unfortunately also skips the wait for entries from before crash to be applied. The skip will happen by default, fortunately 5s wait is enough to prevent this happening in most cases.After looking deeper I found more issues.
TestCtlV3ConsistentMemberList
test introduced in Fix memberList may return incorrect intermediate results right after bootstrap #16658 and uses SIGKILL.The text was updated successfully, but these errors were encountered: