Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEP-509: Stateless validation stage 0 #509

Merged
merged 30 commits into from
Jul 26, 2024
Merged

NEP-509: Stateless validation stage 0 #509

merged 30 commits into from
Jul 26, 2024

Conversation

walnut-the-cat
Copy link
Contributor

WIP

initial draft
@render
Copy link

render bot commented Sep 19, 2023

Add validator role change section
@frol frol added WG-protocol Protocol Standards Work Group should be accountable S-draft/needs-author-revision A NEP in the DRAFT stage that needs an author revision. A-NEP A NEAR Enhancement Proposal (NEP). labels Nov 1, 2023
@frol
Copy link
Collaborator

frol commented Nov 1, 2023

Hi @walnut-the-cat – thank you for starting this proposal. As the moderator, I labeled this PR as "Needs author revision" because we assume you are still working on it since you submitted it in "Draft" mode.

Please ping the @near/nep-moderators once you are ready for us to review it. We will review it again in early January, unless we hear from you sooner. We typically close NEPs that are inactive for more than two months, so please let us know if you need more time.

@walnut-the-cat walnut-the-cat self-assigned this Apr 12, 2024
neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Show resolved Hide resolved
neps/nep-0509.md Show resolved Hide resolved
wacban and others added 6 commits June 12, 2024 16:54
@bowenwang1996
Copy link
Collaborator

As a working group member, I'd like to nominate @mfornet and @birchmd as SME reviewers for this NEP.

Copy link
Contributor

@birchmd birchmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As SME and working group member, I lean towards approving the NEP. It is exciting to see Near continue to push towards being a completely scalable protocol. Avoiding the complexity of fraud proofs while having an eye towards using ZK technology in the future is very clever. Thanks to everyone for their hard work on designing and implementing this large protocol change.

neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
Shreyan Gupta and others added 2 commits June 18, 2024 14:14
@birchmd
Copy link
Contributor

birchmd commented Jun 26, 2024

@mm-near The "40 years at 90% confidence" calculation was done by me.

It assumes that the attacker has just barely less than 1/3 of the total stake (so they cannot outright take over the protocol), which is about 197 million $NEAR as of today.

The calculation determines the probability of a shard assignment (recall that stake is converted to "mandates" and these are randomly assigned to shards) in which at least one shard has 2/3 of its assigned stake controlled by the attacker. In that case the attacker would be free to push an invalid state transition because it could sign the invalid state witness itself. With 68 mandates per shard and 6 shards total this probability is 8.6e-10.

Then we assume the shard assignments are independent so that we can model it as a Bernoulli process and see how many "trials" it would take before we have a "success" (i.e. how many random shard assignments are there before the attacker obtains a 2/3 majority in one shard). The probability of having m "failures" in a row in a Bernoulli process is (1-p)^m and we want that to happen with 90% confidence (a somewhat arbitrary value chosen by me), so we can have m = ln(0.9)/ln(1-p) trials. This works out to be around 122 million trials.

Now that we know the number of trials we can convert it into a time. At 1 trial per second that is almost 4 years, but at the time Bowen was suggesting to shuffle less often than every block. At 1 trial per 10 seconds we get almost 40 years, which is the number I reported.

We can also do this calculation the other way though. If we take the 5 year timeline you propose, then we can convert that into a number of trials. Let's assume one trial per second since I think the current implementation does shuffle validators every block. Then that is around 157 million trials and we want to know in our Bernoulli process what is probability of having at least 1 success within that many trials. This probability is 1 minus the probability that we have all those trials fail in a row, so 1 - (1-p)^N. This works out to be around 12.7%. So if someone controlled 197 million $NEAR staked Near for five years then there is a 12.7% chance that they would have the opportunity to push an invalid state transition. If we instead assume only 1 trial every 10 seconds then this probability reduces down to around 1.3%.

If you keep the number of mandates per shard the same then this whole calculation does not change much as you increase the number of shards because the theory says that the dependency on the number shards is not very strong after you have more than a few. So the base probability of 8.6e-10 should stay close to the same for any number of shards. But note that increasing the number of shards while keeping the number of mandates per shard the same means increasing the total number of mandates.

@victorchimakanu
Copy link

victorchimakanu commented Jun 27, 2024

NEP Status (Updated by NEP Moderators)

Status: VOTING

SME reviews:

Protocol Work Group voting indications (❔ | 👍 | 👎 ):

@pugachAG
Copy link
Contributor

pugachAG commented Jul 2, 2024

@mm-near

Do we have info on how big the state witnesses are going to be on average ? (based on current traffic patterns)

Please find the metrics based on the current mainnet traffic for a window of 12 hours.

max witness size

Max witness size affects chunk validation latency.
Screenshot 2024-07-02 at 18 35 14

avg witness size

Avg witness size determines additional chunk validation network usage.
Screenshot 2024-07-02 at 18 37 04

github-merge-queue bot pushed a commit to near/nearcore that referenced this pull request Jul 2, 2024
# Feature to stabilize
This PR stabilizes the Congestion Control and Stateless Validation
protocol features. They are assigned separate protocol features and the
protocol upgrades should be scheduled separately.

# Context

* near/NEPs#539
* near/NEPs#509

# Testing and QA
Those features are well covered in unit, integration and end to end
tests and were extensively tested in forknet and statelessnet.

# Checklist
- [x] Link to nightly nayduck run (`./scripts/nayduck.py`,
[docs](https://github.com/near/nearcore/blob/master/nightly/README.md#scheduling-a-run)):
https://nayduck.nearone.org/
- [x] Update CHANGELOG.md to include this protocol feature in the
`Unreleased` section.
@Longarithm
Copy link
Member

Longarithm commented Jul 2, 2024

@mm-near the latency you mentioned matches existing one.

Before: BP sends block quickly on receiving chunks, but block is validated only after other block producers apply all its chunks - it was their only way to validate chunks in block. So the next block production happens only after previous chunks were applied.
After: BP, additionally, has to wait for endorsements from CVs. But it is equivalent to waiting on applying previous chunks, it is just performed by CVs based on state witness now. After that block is quickly validated by endorsement signatures verification.

Also, BP&CPs are also CVs, so stake on chunk validation remains big. Memtrie is much faster than disk trie, which compensates network latencies for sending state witnesses and endorsements.


UPD: the actual additional latency is introduced on chunk producer side: near/nearcore#10584

Shortly: to produce chunk N, CP must apply chunk N-1, for which BP must produce block N-1, for which CVs must validate ( = apply) chunk N-1. So applying of chunk N-1 appears twice.
But again, we expect that speedup in chunk application outweighs that.


Side notes:

  • If BP tracks shard, it will apply chunks for it, but it doesn't block receiving other blocks.
  • Latency of user waiting for transaction outcome shouldn't change.

Let's say only one shard is touched by transaction. To get outcome, we query the RPC node which tracks touched shard.
If RPC node tracks shard, it applies chunks from blocks immediately without waiting for endorsements - because chunk application is deterministic.
Chunks are validated on chain by endorsements in the next block with chunk, but if user is optimistic, they can just rely on RPC node’ response.

wacban added a commit to near/nearcore that referenced this pull request Jul 3, 2024
# Feature to stabilize
This PR stabilizes the Congestion Control and Stateless Validation
protocol features. They are assigned separate protocol features and the
protocol upgrades should be scheduled separately.

# Context

* near/NEPs#539
* near/NEPs#509

# Testing and QA
Those features are well covered in unit, integration and end to end
tests and were extensively tested in forknet and statelessnet.

# Checklist
- [x] Link to nightly nayduck run (`./scripts/nayduck.py`,
[docs](https://github.com/near/nearcore/blob/master/nightly/README.md#scheduling-a-run)):
https://nayduck.nearone.org/
- [x] Update CHANGELOG.md to include this protocol feature in the
`Unreleased` section.
marcelo-gonzalez pushed a commit to marcelo-gonzalez/nearcore that referenced this pull request Jul 3, 2024
# Feature to stabilize
This PR stabilizes the Congestion Control and Stateless Validation
protocol features. They are assigned separate protocol features and the
protocol upgrades should be scheduled separately.

# Context

* near/NEPs#539
* near/NEPs#509

# Testing and QA
Those features are well covered in unit, integration and end to end
tests and were extensively tested in forknet and statelessnet.

# Checklist
- [x] Link to nightly nayduck run (`./scripts/nayduck.py`,
[docs](https://github.com/near/nearcore/blob/master/nightly/README.md#scheduling-a-run)):
https://nayduck.nearone.org/
- [x] Update CHANGELOG.md to include this protocol feature in the
`Unreleased` section.
@shreyan-gupta
Copy link

@mm-near

for Reed Solomon Erasure encoding - do we still plan to send it to all the block producers (for all the shards?)

The main purpose of the Reed Solomon Erasure encoding for state witness is to reduce the load on the chunk producer for distributing the state witness. The recipients of the state witness are all the chunk validators, and they are the ones who participate in the partial witness forward and not block producers.

This way we don't put too much network load on the block producers and the network load is localized to the chunk validators. Nodes that have higher number of mandates are validators for multiple shards.

Copy link
Member

@mfornet mfornet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a Protocol WG member, I lean towards approving this proposal since it is a necessary step towards effective sharding.

My main concern is concerning chunk validators:

In this approach, I'm concerned with chunk validators' incentives to validate new chunks. As I understand from this document, the optimal strategy for individual chunk validators is to accept every chunk. As long as there is one honest chunk validator, work is not needed, and they don't get penalized for incorrectly endorsing an invalid chunk.

### Assumptions

* Not more than 1/3 of validators (by stake) is corrupted.
* In memory trie is enabled - [REF](https://docs.google.com/document/d/1_X2z6CZbIsL68PiFvyrasjRdvKA_uucyIaDURziiH2U/edit?usp=sharing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the content of the linked document to neps/assets in this repository, in case the current link gets broken for some reason?

neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
Comment on lines +211 to +213
As we pointed out above, current formula `chunk_validator_quality_ratio` is problematic.
Here it brings even a bigger issue: if chunk producers don't produce chunks, chunk validators will be kicked out as well, which impacts network stability.
This is another reason to come up with the better formula.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk validators can collude and not endorse some chunks in a way that some chunk producers or other chunk validators get kicked out by not getting their chunks included.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is more relevant to the chunk endorsement process, not chunk validator kickouts/rewards.

And the base assumption of new approach is to make event "1/3 validators of chunk collude" mean that "1/3 of all validators collude" with high probability, so in this case the base blockchain security assumption fails, on which we rely on.

neps/nep-0509.md Outdated Show resolved Hide resolved
neps/nep-0509.md Outdated Show resolved Hide resolved
walnut-the-cat and others added 8 commits July 8, 2024 08:13
Co-authored-by: Marcelo Fornet <[email protected]>
Co-authored-by: Marcelo Fornet <[email protected]>
Co-authored-by: Marcelo Fornet <[email protected]>
Co-authored-by: Marcelo Fornet <[email protected]>
Co-authored-by: Marcelo Fornet <[email protected]>
Co-authored-by: Marcelo Fornet <[email protected]>
Co-authored-by: Marcelo Fornet <[email protected]>
@Longarithm
Copy link
Member

@mfornet answered to chunk validator-related comments.

Yeah, this is a known problem. We discussed it couple times. One idea was to introduce "honeypot state witnesses", the goal of which would be to verify that state witnesses can get invalidated, and penalise validators for blind approvals.

However, the counterarguments are that

  • we have other places in the consensus where blind approval is not penalised at all - e.g. nothing prevents block validators to endorse all blocks
  • for small chunk validators effect is negligible; for bigger stake on blind approvals the situation is effectively the same as many validators colluding, which we can't control anyway.

So any of these solutions would introduce additional complexity (which is already very substantial) and the benefit didn't become clear.

@walnut-the-cat walnut-the-cat marked this pull request as ready for review July 19, 2024 17:31
@walnut-the-cat walnut-the-cat requested a review from a team as a code owner July 19, 2024 17:31
@flmel flmel added S-approved A NEP that was approved by a working group. and removed S-review/needs-sme-review A NEP in the REVIEW stage is waiting for Subject Matter Expert review. labels Jul 22, 2024
@flmel
Copy link
Member

flmel commented Jul 26, 2024

Thank you to everyone who attended the Protocol Work Group meeting! The working group members reviewed the NEP and reached the following consensus:

Status: Approved (Meeting Recording: https://youtu.be/058BZEyXzgU)

@walnut-the-cat Thank you for authoring this NEP

@birchmd @mfornet Thank you for the review!

@flmel flmel merged commit 49b56d1 into master Jul 26, 2024
4 checks passed
@flmel flmel deleted the state-validation branch July 26, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-NEP A NEAR Enhancement Proposal (NEP). S-approved A NEP that was approved by a working group. WG-protocol Protocol Standards Work Group should be accountable
Projects
Status: APPROVED NEPs
Development

Successfully merging this pull request may close these issues.