-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLA+ spec and trace validation for raft consensus algorithm in etcd implementation #111
Comments
PR ref #112 |
Thanks for proposing this @joshuazh-x, this looks very useful to validate correctness. Couple of questions from me:
|
cc @tbg @Elbehery @pavelkalinnikov @erikgrinaker |
Thanks @joshuazh-x +1 @serathius, if we plan to add/merge #112, and replace it with current raft impl, we need to exhaust test this in the first place. Also I do not think maintaining two separate impl is a good idea. iiuc, many projects are based on https://github.com/etcd-io/raft, not only etcd |
One idea: generate the TLA+ spec from a standalone Go package. The spec would be broken down into a bunch of exported constants. We can then reference these constants from the implementation, to complement the documentation. For example, suppose the spec has something like this const HandleAppendEntriesRequest = `
\* Server i receives an AppendEntries request from server j with
\* m.mterm <= currentTerm[i]. This just handles m.entries of length 0 or 1, but
\* implementations could safely accept more by treating them the same as
\* multiple independent requests of 1 entry.
\* @type: (Int, Int, AEREQT) => Bool;
HandleAppendEntriesRequest(i, j, m) ==
LET logOk == \/ m.mprevLogIndex = 0
\/ /\ m.mprevLogIndex > 0
/\ m.mprevLogIndex <= Len(log[i])
/\ m.mprevLogTerm = log[i][m.mprevLogIndex].term
IN
/\ m.mterm <= currentTerm[i]
/\ \/ RejectAppendEntriesRequest(i, j, m, logOk)
\/ ReturnToFollowerState(i, m)
\/ AcceptAppendEntriesRequest(i, j, logOk, m)
/\ UNCHANGED <<candidateVars, leaderVars>>
`
// Spec is the complete TLA+ spec that can be exported and
// run through the TLA+ model checker / tooling.
const Spec = Thing1 +
HandleAppendEntriesRequest +
Thing2 + ...; The implementation of the ... somewhere in the weeds of MsgApp handler ...
_ = tlaplus.HandleAppendEntriesRequest This is convenient to work with in IDEs, where you can jump to the definition of this constant at a keystroke. |
IMO, the current implementation has a monolithic design that makes maintenance hard either way. To reduce the maintenance cost, we need to separate the core of the algorithm from the implementation details that don't impact correctness. The core algorithm won't change much, and would be paired with the TLA+ spec for assurance. Implementation details (like buffer sizes, various quota and retry aspects) don't need a spec, and are also more prone to changes. |
Thank you very much for your feedback and suggestions. @serathius I believe your first two questions can be addressed by the trace validation in another PR #113. PR #112 is actually a subset of PR #113 , which is purely about algorithm itself. PR #113 is more about how to guarantee the implementation matches TLA+ spec. I thought it might be good to separate them in different issues as I only tested it in etcd for now. But since your pops out this good question, I think we can discuss them together. IMO, the correctness of service includes two parts: correctness of algorithm, and alignment of implementation and algorithm. In PR #113 we inject multiple trace log points in the code places where state transitions are made (e.g. SendAppendEntries, BecomeLeader, etcd). The trace validation spec uses these traces as a state space constraint so that it walks the state machine in the same way the service did. If there is a trace that indicates any state or transition that cannot happen in the state machine, it implies a mismatch between the model and implementation. If the model is already verified by TLC model checker, then most likely it is the implementation that has some issue. And regarding the third ask, yes, I'll draft a readme to outline how to use this tool. |
@Elbehery the TLA+ spec is not a new implementation, nor meant to replace existing raft lib. You can consider it as the design (in the format of formal spec) of current raft, and we can use TLC tool to verify its correctness. |
@pavelkalinnikov yes, we don't need to put all implementation details into the spec, just the states and transitions of the basic algorithm would be enough. This part is not subject to change often. But the question from @serathius is also important, that is something we need to assure eventually for the correctness of service running on top of raft. That's why we propose TLA+ trace validation in PR #113. |
I like the idea of tracing and integrating with TLA+ for validation. Thank you for doing this. Although maintaining two implementations is challenging (only at times when the code changes conflict with the spec though, which should be rare; moreover, it's a good thing - there should be some resistance to risky changes), TLA+ validation would add extra safety net around the current implementation. The spec will not rot if it's integrated with the code this way. One component that would be great to have is randomized testing. We have a datadriven test harness that makes testing deterministic and reproducible, but we would greatly benefit if these tests were randomized too (but still reproducible). If we then feed the random traces into TLA+ checks, and run this whole thing for hours, it will massively boost the coverage. One request for this TLA+ integration: it should be "free" w.r.t. the production code. Please hide it under cheap checks, and/or build tags like: if tla.Enabled { // this compiles to "if false" and optimizies away in prod
traceEvent(...)
} // +build with_tla
package tla
const Enabled = true
// all helpers here // +build !with_tla
package tla
const Enabled = false
// all helpers are no-ops Consider also that we already have a similar form of tracing, which is similarly fed into tests, and used for deterministic validation. E.g. see an example test output. Consider integrating the two mechanisms. |
I chatted with both Lamport/Yuan in 2015 to discuss this issue (https://www.microsoft.com/en-us/research/publication/specifying-and-verifying-systems-with-tla/) Here is the "conclusion" from them:
I guess we can probably define this mapping from Go to TLA+ as well, considering the current Go raft implementation is relatively small and only uses a very constrained subset of Go libraries. But I am not sure if it is worth the effort. Or maybe we just need to manually maintain the two different implementations and use TLA+ to guide the thought process. |
Hello, I'm the author of trace validation in CCF and a member of the TLA+ project.
Trace Validation (TV) essentially adopts this approach by mapping recorded implementation states to TLA+ specification states. This method is practical as it derives implementation states directly from the execution of the implementation, avoiding the complexities of symbolic interpretation of the source code. However, TV is not exhaustive. It verifies that only a certain subset of all potential system executions aligns with the behaviors defined in the specification. Nevertheless, this is not necessarily a limitation if the chosen subset of system executions is sufficiently diverse. For example, in CCF, among other discrepancies, we were able to find two significant data-loss issues using TV with just about a dozen system executions.
In CCF, this feature is still on our roadmap. However, manually translating TLA+ counterexamples (violations of the spec's correctness properties) into inputs for the implementation was straightforward. |
Updated PR #113 to add readme and build tag. @serathius @pavelkalinnikov |
Things I got from the discussion:
List of followups (please confirm I haven't missed anything)
Overall LGTM from me. Would also request an official LGTMs from people active in the project to ensure we get consensus: @ahrtr @Elbehery @pavelkalinnikov @erikgrinaker |
ping @ahrtr @Elbehery @pavelkalinnikov @erikgrinaker |
LGTM |
Thanks for raising this issue.
|
I think this is very useful work, can we find some low-risk path forward and not stall this work / lose the momentum entirely? Do I understand correctly that the actual code change should be minimal? My understanding is that it boils down to a dozen or so "tracing" calls in key places, and the whole thing is active only in tests. From this point of view, this work can be done with low risk (does this answer the "controlled" part?). Some part of this work is "dev infra" to enable checks in CI. I think this part is skippable at the beginning while the author is working on the spec and periodically runs the validation. Providing an instruction on running it manually should be fine. Most of this work seems to be the actual spec. Qualifying it obviously requires a reviewer who understands TLA+. In this group of people, is there anyone? For myself, I'm interested in learning this topic and started doing so a while ago. I can potentially devote a low-priority / hobby stream of effort to learn and review this. @joshuazh-x can you think of a way to make incremental progress / split #113 into steps so that spec is introduced piece by piece? For example, is it possible to do spec just for election, and then add appends, etc? The reviewer needs to "follow" some train of thought to not get lost in this PR. |
@joshuazh-x do you have an end-to-end prototype of this that you've been running for a while and it passed the checks? Or found some spec violations / bugs? The value of this work can be more apparent if it finds a real bug as a sneak peek :) But maybe the code is bug-less, then it's harder to sell this work. But no spec violations is a good signal too. In this case, for demo purposes, you could inject some subtle bugs and demonstrate how the spec checker would find them. |
Please note that this work doesn't directly block work on v3.6 nor does it slows it down. As for priority this effort contributes to correctness testing which I'm always open to. I also see a potential to utilize TLA+ validation to improve etcd membership testing which is hard to properly test using robustness tests and has been causing us issues with v2 deprecation. |
The actual spec in PR #113 is directly derived from Ongaro's original raft spec. We could even go as far as model-checking or proving that etcdraft formally refines Ongaro's spec to minimize the of reviewing the spec. However, what this really means is that the review can be narrowed down to the specification of etcd's reconfiguration logic, which boils down to a handful of TLA+ actions (etcdraft.tla line 418 to 487). |
Thanks for the info, which gives us much more confidence. I will take a closer look at both the issue and PR later. |
I have a PoC to enable trace validation in etcd's test suites. Basically collecting protocol traces and validate them using the TLA+ spec. I now realize it might be better to make a random test with trace validation here in Raft repo. I'm working on this and will update PR when it is ready. I personally see trace validation as a good tool to validate the correctness of implementation based on a proven model. For now the trace validation only covers core raft algorithm (including reconfiguration). I don't expect some bug can be found here. But trace validation itself is expandable. Any application based on raft can build its own trace validation on top of raft trace validation. In such a way, their own specific features can be covered. One example off the top of my head is etcd's lease manager, which I heard would need to redesign to address consensus issue. |
I added a test case to feed random faults to a cluster and generate trace logs for trace validation. Also update readme to include instructions to run trace validation (last part of the readme file). To generate test trace logs, run TestRunClustersWithRandomFaults with with_tla flag and environment variable TEST_RACE_LOG set to a folder to save trace logs. |
There have been multiple formal specifications of raft consensus algorithm in TLA+, following Diego Ongaro's Ph.D. dissertation. However, I have not seen a version that aligns to the raft library implemented in etcd-io/raft, which we know are different from the original raft algorithm in some behaviors, e.g., reconfiguration.
etcd and many other applications based on this raft library have been running good for long time, but I feel it would still be worthy to write a TLA+ spec for this specific implementation. It is not just to verify the correctness of the model, but also a foundation of a following up work in model-based trace validation.
The spec is based on George Pîrlea's raft TLA+ spec and inspired by the innovative work in Microsoft CCF's TLA+ spec.
The text was updated successfully, but these errors were encountered: