Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: Consensus Write-Ahead Log (WAL) #469

Open
3 tasks
Tracked by #578
cason opened this issue Oct 16, 2024 · 13 comments · May be fixed by #785
Open
3 tasks
Tracked by #578

spec: Consensus Write-Ahead Log (WAL) #469

cason opened this issue Oct 16, 2024 · 13 comments · May be fixed by #785
Assignees
Labels
spec Related to specifications sync Synchronization protocols

Comments

@cason
Copy link
Contributor

cason commented Oct 16, 2024

In order to support the crash-recovery failure model (#578), the consensus implementation should persist all relevant events that have lead it to its current state. When recovering from a crash, the implementation is initialized from its initial state of the latest active height H, then has to replay all the information persisted before it has crashed.

The valid events processed by the consensus implementation are therefore typically persisted in a Write-Ahead Log (WAL), an append-only log that was originally conceived to ensure atomicity of transactions in databases.

Definition of Done

  • Define exactly which events should be persisted to the WAL
  • Define the strategy for writing data to the WAL (e.g., synchronously versus asynchronously)
  • Define the procedure for replaying the content of the WAL when a node is (re)started
@cason cason added spec Related to specifications work in progress Work in progress labels Oct 16, 2024
@cason cason changed the title spec: Consensus Write-Ahead Log (WAL) spec: consensus Write-Ahead Log (WAL) Oct 16, 2024
@cason cason added sync Synchronization protocols and removed work in progress Work in progress labels Nov 19, 2024
@cason
Copy link
Contributor Author

cason commented Nov 20, 2024

Define exactly which events should be persisted to the WAL

We need to persist all events, received from external components, that may lead to state-transition in the consensus logic.

More specifically:

  • Valid consensus messages: PROPOSAL, PREVOTE, and PRECOMMIT
    • We don't plan to store full proposed values v in the consensus WAL
    • So the PROPOSAL messages carry id(v) instead of v
    • Vote messages always carry id(v)
  • Expired timeouts: timeout_propose, timeout_prevote, and timeout_precommit
    • The implementation may, for optimization, cancel schedule timeouts when they become useless
    • Notice that storing a timeout expiration event that did not produce any effect is not a problem at all
    • The timeouts are going to be scheduled during the replay procedure, storing them speeds-up recovery
  • Events associated to the production or the receipt of proposed values
    • Full proposed values are not stored in the WAL, but by the dissemination logic
    • Once a value to be proposed by a process is produced (getValue()), the consensus logic is notified
    • Once a proposed value is received, processed, and validated (valid(v)), the consensus logic are notified
    • The consensus state machine reacts to those notifications or events, that therefore must be persisted to the WAL

@cason
Copy link
Contributor Author

cason commented Nov 20, 2024

Define the strategy for writing data to the WAL (e.g., synchronously versus asynchronously)

All the received events can be persisted to the WAL asynchronously, in a best effort manner. But once a set of input events lead to a state transition, with the potential production of an output, the WAL must be persisted in a synchronous way.

We define the following list of actions before which the WAL should be flushed (synchronously persisted):

  • A message is sent by the node
  • The process switches to a new round of consensus
  • The process decides a value in a height of consensus

The above listed actions are the result of receiving a number of events (inputs), which can be asynchronously written to the WAL. But once the resulting action is produced, we must be sure that all events that have lead to the action are persisted. In this way, when recovering from a crash, the node is able to produce, based on the same inputs, the exactly same actions.

@cason
Copy link
Contributor Author

cason commented Nov 20, 2024

Define the procedure for replaying the content of the WAL when a node is (re)started

When a height of consensus is (re)started, the consensus state machine should:

  1. Consume all the events present in the WAL, referring to the current height. (If the WAL is from a lower height, the WAL is reset/deleted, and a new WAL for the current height is created, otherwise, we continue with the points below)
  2. Produce all actions resulting from processing the persisted events
  3. Then start consuming external inputs, for instance, coming from the broadcast/gossip network

A relevant observation is that a height of consensus must be re-started after all the committed blocks are applied (see #580) and after the storage for produced or received full values is open and restored (see #579).

@josef-widder
Copy link
Member

The WAL should ensure that a recovered process has the following behavior:

  1. It reaches a state that it had been in when it crashed (or shortly before)
  2. While processing the WAL, the process should not send messages that are in conflict with messages that were sent before the crash (no double sign)

The only different to a correct process is that

  1. a recovering process my send the same message multiple times (typically no problem)
  2. while being down, it might have missed some incoming messages. So there are corner cases where in order to ensure progress, vote sync is needed.

@josef-widder
Copy link
Member

Observe that this is based on the fact that once a process locally has persisted a blockstore entry (block and commit) of height h, the process may ignore all messages from heights less than or equal to h from this point on. So persiting a point is a big synchronization event, while flushing on sending later are smaller synchronization events.

@cason
Copy link
Contributor Author

cason commented Nov 21, 2024

The WAL should ensure that a recovered process has the following behavior:

1. It reaches a state that it had been in when it crashed (or shortly before)

2. While processing the WAL, the process should not send messages that are in conflict with messages that were sent before the crash (no double sign)

A more precise definition of "shortly before" can be derived from item 2.

Namely, if an action was produced before crashing, consider the latest action produced. The state of the process after recovering must be the same as when it produced the latest action, or a later (successive) state. The internal state transitions that do not produce actions might be lost, namely the events that triggered them might not be synchronously persisted, therefore are lost. This is not a problem as long the "lost" events did not produce any external observable action.

@josef-widder
Copy link
Member

For the WAL we need to make sure that the driver is deterministic. So we need to review everything. In particular folds in Qunit. Also pendingInputs should be transformed from a set to a list.

@cason
Copy link
Contributor Author

cason commented Nov 21, 2024

  1. while being down, it might have missed some incoming messages. So there are corner cases where in order to ensure progress, vote sync is needed.

By vote sync we mean the protocol drafted in #576.

@romac romac changed the title spec: consensus Write-Ahead Log (WAL) spec: Consensus Write-Ahead Log (WAL) Dec 19, 2024
@cason
Copy link
Contributor Author

cason commented Jan 6, 2025

Besides of better documenting the solution, can we consider it solved?

@cason cason linked a pull request Jan 17, 2025 that will close this issue
3 tasks
@cason
Copy link
Contributor Author

cason commented Jan 23, 2025

Hey folks, do you remember what was the designed procedure to handle getValue() calls in a recovering process? How is this handled in the implementation?

In am particular interested in the case where a recovering process was the proposer of multiple rounds of a height before crashing, therefore potentially has provided multiple distinct values to consensus as a response/return to getValue().

Also in this case, broadcasting again values proposed in rounds that have failed is not really ideal, is it?

@josef-widder
Copy link
Member

I guess we talked about having another persistent blockstore, where we put the proposals seen/generated for the current height. In contrast to the real blockstore, the proposal store can be deleted in the next height.

@cason
Copy link
Contributor Author

cason commented Jan 24, 2025

I guess we talked about having another persistent blockstore, where we put the proposals seen/generated for the current height. In contrast to the real blockstore, the proposal store can be deleted in the next height.

Yes, we have this as part of #579. But how we handle the case where getValue() is called multiple times upon recovery? How the "application" knows which one to return? Does it re-broadcast every "potential" proposed value again upon recovery?

@romac, @ancazamfir, how this work in the implementation?

@cason
Copy link
Contributor Author

cason commented Jan 24, 2025

#579 (comment)

If this is the case, I would say that half of my question is solved. Namely, the "application" is able to know which proposed value to return in case of multiple rounds.

It remains open whether all the potential multiple proposal values are re-broadcast/streamed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec Related to specifications sync Synchronization protocols
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants