--experimental-wait-cluster-ready-timeout causing stale response to linearizable read #16666

serathius · 2023-09-29T14:51:13Z

While working and discussing #16658 with @ahrtr I spotted one issue. Etcd going back in time during bootstrap. The good news is that the current reproduction limit the issue to v3.6 release. Impact on older releases is still under investigation.

As described in #16658 (comment) graceful shutdown via SIGTERM allows etcd flushing it's database to disk, however SIGKILL will mean that data on disk might be older than in memory state of etcd. While bootstrapping etcd will catch up on changes it has forgotten by replaying WAL. Etcd might go back in time if it started serving data before it caught up to state before the kill. But this doesn't happen right?

Unfortunately it's possible. In #13525 we have added a flag that skips the wait. Fortunately the change is only present on v3.6. --experimental-wait-cluster-ready-timeout goal was to skip the wait for quorum during member bootstrap allowing member to start serving serializable requests. This unfortunately also skips the wait for entries from before crash to be applied. The skip will happen by default, fortunately 5s wait is enough to prevent this happening in most cases.

After looking deeper I found more issues.

[confirmed] --experimental-wait-cluster-ready-timeout causing member serializable responses to go back in time. Repro 51fb2d7 It's done by reusing TestCtlV3ConsistentMemberList test introduced in Fix memberList may return incorrect intermediate results right after bootstrap #16658 and uses SIGKILL.
[confirmed] --experimental-wait-cluster-ready-timeout breaking linearizability in single node cluster. In single node cluster, leader will not check readIndex the same way. It will trust it's local committed index, which is not guaranteed to be persisted before responding to user. Repro Reproduce --wait-cluster-ready-timeout flag causing linearizability issue #16672
Even without --experimental-wait-cluster-ready-timeout, clusterReady mechanism seems strange. It just waits for a joining member to be able to make a proposal (set it's own member attributes), however this has noting to do with replaying entries. I expect that with proper timings (injecting sleep into code) the etcd will start serving before all entries from before crash have been applied. I expect it can happen on both single and multi node cluster.

The text was updated successfully, but these errors were encountered:

serathius · 2023-09-29T15:03:22Z

cc @gyuho @jpbetz @philips @hexfusion @bdarnell

Would be great if to get your opinion.

ahrtr · 2023-09-29T19:18:44Z

I will try to create a dedicated test case to reproduce this issue on key space next week firstly.

serathius · 2023-09-30T13:04:13Z

Hmm, doesn't reproduce for me. I was trying to use robustness tests on 1 node cluster with serializable request. Found a bug in the test though #16658 (comment)

serathius · 2023-09-30T14:18:25Z

Ok, found it #16672 but it's worse. Looks like single node lineralizability issue.

serathius · 2023-10-01T09:35:49Z

Summarized the current findings in the top comment. Thanks to @ahrtr who already prepared two different fix proposals (etcd-io/raft#105, #16675). I think the proper solution will require making some trade-off, so I think we should revisit the design of etcd ready code.

serathius · 2023-10-02T10:20:40Z

Did some testing for the third case however without success. Possibly I'm missing some dependency between cluster readiness and applying raft. If there was this kind of issue in etcd, we would have caught it earlier. I will followup on this in #16673 to ensure we properly test and find all races in etcd bootstrap.

Still the issue of --experimental-wait-cluster-ready-timeout stays. @ahrtr has proposed some fixes to this, however they come with some downsides. My suggestion would be to remove --experimental-wait-cluster-ready-timeout first and then reconsider it after we have a proper design for etcd bootstrap that is linearizable.

ahrtr · 2023-10-02T14:26:21Z

Just added two test cases which can reproduce the issue super easily.

Rollbacking #13525 is the most pragmatic approach. It can resolve this issue, but it will also rollback the fix for the following two issues:

etcd can't serve serializable requests after restarts if the quorum isn't satisfied;
K8s liveness probe might restart an etcd node while other nodes are not started yet or still in progress of starting. Accordingly it will make the situation even worse, and lead to vicious circle.

The first issue seems not a big deal. But the second one is a major issue, we should take care of in the /livez and /readyz feature. Specifically, when a node is blocked on the ReadyNotify, the /livez should NOT return not live status to client.

chaochn47 · 2023-10-03T00:27:38Z

Thanks @serathius @ahrtr taking care of this, I believe I have seen this issue in the release-3.4. The reproduce is simple. Restart etcd server right after member reconfiguration and query the member list via HTTP /members, its handler will bypass the linearizable check. The member list response would be stale during bootstrap (restart) where the members are restored from v2 store and WAL is still replaying...

# checkout code in 
# https://github.com/chaochn47/etcd/tree/v3.4.20-eks.0-reproduce-member-name-mismatch
cd integration && go test -v -run TestReproduceMemberNameMismatch

Adding --experimental-wait-cluster-ready-timeout in 3.6 seems like surfacing this issue up while it could have been protected by publishing local member attribute (name and client URL) to cluster. Note, in etcd client it assumes if the name or clientURL (the member attributes) is empty, then the server has not yet started and added to the endpoints before serving traffic.

etcd/client/v3/client.go

Lines 185 to 200 in 979102f

    
           // Sync synchronizes client's endpoints with the known endpoints from the etcd membership. 
        
           func (c *Client) Sync(ctx context.Context) error { 
        
           	mresp, err := c.MemberList(ctx) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	var eps []string 
        
           	for _, m := range mresp.Members { 
        
           		if len(m.Name) != 0 && !m.IsLearner { 
        
           			eps = append(eps, m.ClientURLs...) 
        
           		} 
        
           	} 
        
           	c.SetEndpoints(eps...) 
        
           	c.lg.Debug("set etcd endpoints by autoSync", zap.Strings("endpoints", eps)) 
        
           	return nil 
        
           }

. I bet I have seen this in the other code bases like etcdctl with this assumption.

So my stance is rolling #13525 back for correctness.

/cc @siyuanfoundation @wenjiaswe we might be impacted by the serializable check method as mentioned by #16666 (comment) :/

ahrtr · 2023-10-03T09:38:39Z

Restart etcd server right after member reconfiguration and query the member list via HTTP /members, its handler will bypass the linearizable check.

Actually the main also has this "issue", but this might not be a problem. Two reasons:

The /members is only supposed to accessed during bootstrap by a new etcd member who has just been added to the cluster;
The /members is protected by peer certificate; in other words, users shouldn't be able to access this endpoint at all.

EDIT: but I still think it may be better to guard the /members by linearizable check. Please raise a separate issue to track it, and we can discuss it separately.

chaochn47 · 2023-10-03T17:03:22Z

K8s liveness probe might restart an etcd node while other nodes are not started yet or still in progress of starting. Accordingly it will make the situation even worse, and lead to vicious circle.

But the second one is a major issue, we should take care of in the #16651. Specifically, when a node is blocked on the ReadyNotify, the /livez should NOT return not live status to client.

Sounds like a startup probe is needed in this case since the risk of livez returns not live status only happens during bootstrap. Long latency of etcd bootstrap (bbolt initialization) was also observed in one of the test clusters. #15397 (reply in thread)

Don't want to derail the issue and will follow this up in https://docs.google.com/document/d/1PaUAp76j1X92h3jZF47m32oVlR8Y-p-arB5XOB7Nb6U/edit?usp=sharing

serathius · 2023-10-04T13:58:17Z

With #16677 merged we can decide on next steps.

@ahrtr I expect you want to re-implement --experimental-wait-cluster-ready-timeout flag. Before that we need to agree on couple of things, which solution you propose gives us the best trade-off and whether serializable requests should provide sequential consistency (not go back in time). Do you want to continue this in this issue, or we can close and open a separate one?

ahrtr · 2023-10-04T15:05:29Z

We don't have to re-implement --experimental-wait-cluster-ready-timeout.

Unfortunately, I do not see a cheap solution. Indefinitely waiting for the ReadyNotify may affect the bootstrapping time; after etcd gets started, there is no performance impact at all.
We should be good as long as we can resolve the second major issue as I mentioned in --experimental-wait-cluster-ready-timeout causing stale response to linearizable read #16666 (comment) in the /livez probe.

So "Indefinitely waiting for the ReadyNotify" + "enhance /livez to take care of the corner case" on bootstrap is the safest solution I can think of.

serathius · 2023-10-04T18:49:34Z

I agree with the conclusion. We can revisit if needed in the future.

Thanks for great collaboration.

Closing as the main issue was fixed and remaining work will be tracked in followups:

/members peer handler is not guarded with linearizable check #16687
Improvements to robustness tests to prevent issues like --experimental-wait-cluster-ready-timeout #16673

Signed-off-by: Marek Siarkowicz <[email protected]>

Inject sleep during etcd bootstrap to reproduce #16666

chaochn47 · 2023-11-03T20:03:48Z

Even without --experimental-wait-cluster-ready-timeout, clusterReady mechanism seems strange. It just waits for a joining member to be able to make a proposal (set it's own member attributes), however this has noting to do with replaying entries. I expect that with proper timings (injecting sleep into code) the etcd will start serving before all entries from before crash have been applied. I expect it can happen on both single and multi node cluster.

@serathius My understanding is local node proposals will be appended at the end of the replaying entries queue.

Only when all the previous entries have been replayed, the new proposal to publish its name and clientURLs will succeeds.

etcd/server/etcdserver/v3_server.go

Line 767 in aa97484

case x := <-ch:

. So there should not have a linearizability concern.

However, I think this default readiness is not good enough. The upcoming /readyz and gRPC change should further guarantee that etcd is ready to receive traffic.

serathius added the type/feature label Sep 29, 2023

serathius mentioned this issue Sep 30, 2023

Reproduce --wait-cluster-ready-timeout flag causing linearizability issue #16672

Closed

serathius added type/bug and removed type/feature labels Oct 1, 2023

serathius changed the title ~~Responses for serializable requests should not go back in time~~ --experimental-wait-cluster-ready-timeout causing single node linearizability issue Oct 1, 2023

serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 1, 2023

serathius changed the title ~~--experimental-wait-cluster-ready-timeout causing single node linearizability issue~~ --experimental-wait-cluster-ready-timeout causing stale response to linearizable read Oct 2, 2023

serathius mentioned this issue Oct 2, 2023

Revert "etcd server shouldn't wait for the ready notification infinitely on startup" #16677

Merged

ahrtr mentioned this issue Oct 2, 2023

test: add test cases to verify consistent reading right after writing #16678

Merged

chaochn47 mentioned this issue Oct 3, 2023

/members peer handler is not guarded with linearizable check #16687

Open

4 tasks

serathius closed this as completed Oct 4, 2023

This was referenced Oct 5, 2023

etcdserver: ensure hardstate is persisten before applying committed entries #16675

Closed

Return the actually committed index if it's greater than the official committed index for ReadIndex etcd-io/raft#105

Closed

serathius added a commit to serathius/etcd that referenced this issue Oct 7, 2023

Inject sleep during etcd bootstrap to reproduce etcd-io#16666

44f4791

Signed-off-by: Marek Siarkowicz <[email protected]>

serathius added a commit to serathius/etcd that referenced this issue Oct 7, 2023

Inject sleep during etcd bootstrap to reproduce etcd-io#16666

dc052dd

Signed-off-by: Marek Siarkowicz <[email protected]>

serathius added a commit to serathius/etcd that referenced this issue Oct 7, 2023

Inject sleep during etcd bootstrap to reproduce etcd-io#16666

05a7703

Signed-off-by: Marek Siarkowicz <[email protected]>

serathius mentioned this issue Oct 7, 2023

Inject sleep during etcd bootstrap to reproduce etcd-io#16666 #16691

Merged

serathius added a commit that referenced this issue Oct 7, 2023

Merge pull request #16691 from serathius/inject-sleep

f198b41

Inject sleep during etcd bootstrap to reproduce #16666

Uburro mentioned this issue Mar 17, 2024

Implement Ready status condition aenix-io/etcd-operator#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--experimental-wait-cluster-ready-timeout causing stale response to linearizable read #16666

--experimental-wait-cluster-ready-timeout causing stale response to linearizable read #16666

serathius commented Sep 29, 2023 •

edited

Loading

serathius commented Sep 29, 2023

ahrtr commented Sep 29, 2023

serathius commented Sep 30, 2023

serathius commented Sep 30, 2023 •

edited

Loading

serathius commented Oct 1, 2023

serathius commented Oct 2, 2023

ahrtr commented Oct 2, 2023

chaochn47 commented Oct 3, 2023 •

edited

Loading

ahrtr commented Oct 3, 2023 •

edited

Loading

chaochn47 commented Oct 3, 2023

serathius commented Oct 4, 2023

ahrtr commented Oct 4, 2023

serathius commented Oct 4, 2023

chaochn47 commented Nov 3, 2023

--experimental-wait-cluster-ready-timeout causing stale response to linearizable read #16666

--experimental-wait-cluster-ready-timeout causing stale response to linearizable read #16666

Comments

serathius commented Sep 29, 2023 • edited Loading

serathius commented Sep 29, 2023

ahrtr commented Sep 29, 2023

serathius commented Sep 30, 2023

serathius commented Sep 30, 2023 • edited Loading

serathius commented Oct 1, 2023

serathius commented Oct 2, 2023

ahrtr commented Oct 2, 2023

chaochn47 commented Oct 3, 2023 • edited Loading

ahrtr commented Oct 3, 2023 • edited Loading

chaochn47 commented Oct 3, 2023

serathius commented Oct 4, 2023

ahrtr commented Oct 4, 2023

serathius commented Oct 4, 2023

chaochn47 commented Nov 3, 2023

serathius commented Sep 29, 2023 •

edited

Loading

serathius commented Sep 30, 2023 •

edited

Loading

chaochn47 commented Oct 3, 2023 •

edited

Loading

ahrtr commented Oct 3, 2023 •

edited

Loading