Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEP-508: Resharding v2 #508

Merged
merged 29 commits into from
Jan 22, 2024
Merged
Changes from 13 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
746ea70
Create nep-0508.md
walnut-the-cat Sep 19, 2023
2d80281
Update nep-0508.md
walnut-the-cat Sep 19, 2023
c100504
Update nep-0508.md
walnut-the-cat Sep 19, 2023
396c426
Update nep-0508.md
walnut-the-cat Sep 19, 2023
1f1d107
Update nep-0508.md
walnut-the-cat Sep 21, 2023
1566509
Merge branch 'master' into resharding
walnut-the-cat Oct 25, 2023
68a9b62
Update nep-0508.md
walnut-the-cat Oct 25, 2023
051e5aa
Update nep-0508.md
wacban Oct 26, 2023
c0ae86d
reference implementation description and minor nits
wacban Oct 27, 2023
6a8fedc
fix lint
wacban Oct 27, 2023
3f850d2
alternatives
wacban Oct 27, 2023
b621aab
state sync and stateless validation
wacban Oct 31, 2023
1a0ffee
fix lint
wacban Oct 31, 2023
99a3845
remove state sync from dependencies
wacban Nov 1, 2023
26ec5a5
per comments
wacban Nov 1, 2023
0c3b123
Update nep-0508.md (#517)
Nov 14, 2023
bfe52db
Update nep-0508.md
walnut-the-cat Nov 14, 2023
26a88a3
Update nep-0508.md
walnut-the-cat Nov 15, 2023
2953391
Update nep-0508.md - per comments (#521)
wacban Dec 1, 2023
003fdef
Update nep-0508.md
wacban Dec 1, 2023
3f8d425
Update nep-0508.md
wacban Dec 1, 2023
2aa8dbe
lints
wacban Dec 1, 2023
f447163
Update nep-0508.md
walnut-the-cat Dec 1, 2023
2253fc5
Update nep-0508.md
walnut-the-cat Dec 1, 2023
d372e54
Apply suggestions from code review by mfornet
wacban Dec 7, 2023
be0ce88
Update nep-0508.md
wacban Dec 11, 2023
95b6d29
added the new shard layout and a note that more reshardings will happen
wacban Dec 12, 2023
b2ceb5e
better formatting
wacban Dec 12, 2023
4469e45
Added info about rpc to query the shard layout
wacban Dec 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions neps/nep-0508.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
---
NEP: 508
Title: Resharding v2
Authors: Waclaw Banasik, Shreyan Gupta, Yoon Hong
Status: Draft
DiscussionsTo: https://github.com/near/nearcore/issues/8992
Type: Protocol
Version: 1.0.0
Created: 2022-09-19
LastUpdated: 2023-09-19
---

## Summary
walnut-the-cat marked this conversation as resolved.
Show resolved Hide resolved

This proposal introduces a new implementation for resharding and a new shard layout for the production networks.

In essence, this NEP is an extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which was focused on splitting one shard into multiple shards.

We are introducing resharding v2, which supports one shard splitting into two within one epoch at a pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, spliting shard 3 into two).

While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseable future, it does not serve all usecases, such as dynamic resharding.
wacban marked this conversation as resolved.
Show resolved Hide resolved

## Motivation

Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded. In addition, with state sync and stateless validation, validators will not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size.

## Specification

### High level assumptions

* Flat state is enabled.
* Shard split boundary is predetermined. In other words, necessity of shard splitting is manually decided.
* Merkle Patricia Trie is the undelying data structure for the protocol state.
wacban marked this conversation as resolved.
Show resolved Hide resolved
* Minimal epoch gap between two resharding events is X.
* Some form of State Sync (centralized or decentralized) is enabled.

### High level requirements

* Resharding should work even when validators stop tracking all shards.
* Resharding should work after stateless validation is enabled.
walnut-the-cat marked this conversation as resolved.
Show resolved Hide resolved
* Resharding should be fast enough so that both state sync and resharding can happen within one epoch.
* ~~Resharding should not require additional hardware from nodes.~~
* This needs to be assessed during test
* Resharding should be fault tolerant
* Chain must not stall in case of resharding failure. TODO - this seems impossible under current assumptions because the shard layout for an epoch is committed to the chain before resharding is finished
* A validator should be able to recover in case they go offline during resharding.
wacban marked this conversation as resolved.
Show resolved Hide resolved
* For now, our aim is at least allowing a validator to join back after resharding is finished.
* No transaction or receipt should be lost during resharding.
* Resharding should work regardless of number of existing shards.
* There should be no more place (in any apps or tools) where the number of shards is hardcoded.

### Out of scope

* Dynamic resharding
* automatically scheduling resharding based on shard usage/capacity
* automatically determining the shard layout
* merging shards
* shard reshuffling
* shard boundary adjustment
* Shard Layout determination logic (shard boundaries are still determined offline and hardcoded)
* Advanced failure handling
* If a validator goes offline during resharding, it can join back immediately and move forward as long as enough time is left to reperform resharding.
* TBD
walnut-the-cat marked this conversation as resolved.
Show resolved Hide resolved

### Required protocol changes

TBD. e.g. configuration changes we have to introduce

A new protocol version will be introduced specifying the new shard layout.

### Required state changes

TBD. e.g. additional/updated data a node has to maintain

* For the duration of the resharding the node will need to maintain a snapshot of the flat state and related columns.
* For the duration of the epoch before the new shard layout takes effect, the node will need to maintain the state and flat state of shards in the old and new layout at the same time.
wacban marked this conversation as resolved.
Show resolved Hide resolved

### Resharding flow

TBD. how resharding happens at the high level

* The new shard layout will be agreed on offline by the protocol team and hardcoded in the neard reference implementation.
wacban marked this conversation as resolved.
Show resolved Hide resolved
* In epoch T the protocol version upgrade date will pass and nodes will vote to switch to the new protocol version. The new protocol version will contain the new shard layout.
wacban marked this conversation as resolved.
Show resolved Hide resolved
* In epoch T, in the last block of the epoch, the EpochConfig for epoch T+2 will be set. The EpochConfig for epoch T+2 will have the new shard layout.
wacban marked this conversation as resolved.
Show resolved Hide resolved
* In epoch T + 1, all nodes will perform the state split. The child shards will be kept up to date with the blockchain up until the epoch end.
* In epoch T + 2, the chain will switch to the new shard layout.

## Reference Implementation
wacban marked this conversation as resolved.
Show resolved Hide resolved

The implementation heavily re-uses the implementation from [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md). Below are listed only the major differences and additions.

### Flat Storage

The old implementaion of resharding relied on iterating over the full state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on the flat storage in order to build the children shards quicker. Based on benchmarks, splitting one shard by using flat storage can take up to 15min.

The new implementation will also propagate the flat storage for the children shards and keep it up to the with the chain until the switch to the new shard layout. The old implementation didn't handle this case because the flat storage didn't exist back then.

In order to ensure consistent view of the flat storage while splitting the state the node will maintain a snapshot of the flat state and related columns. The existing implementation of flat state snapshots used in State Sync will be adjusted for this purpose.

### Handling receipts, gas burnt and balance burnt

When resharding, extra care should be taken when handling receipts in order to ensure that no receipts are lost or duplicated. The gas burnt and balance burnt also need to be correclty handled. The old resharding implementation for handling receipts, gas burnt and balance burnt relied on the fact in the first resharding there was only a single parent shard to begin with. The new implementation will provide a more generic and robust way of reassigning the receipts, gas burnt and balance burnt that works for arbitrary splitting of shards, regardless of the previous shard layout.

### New shard layout

A new shard layout will be determined and will be scheduled and executed in the production networks. The new shard layout will maintain the same boundaries for shards 0, 1 and 2. The heaviest shard today - Shard 3 - will be split by introducing a new boundary account. The new boundary account will be determined by analysing the storage and gas usage within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used.

### Fixed shards

Fixed shards is a feature of the protocol that allows for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature is only used for testing, it was never used in production and there is no need for it in production. This feature unfortunately breaks the contiguity of shards. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards makes it particularly hard to reason about and implement efficient resharding.

This was implemented ahead of this NEP.

### Transaction pool

The transaction pool is sharded e.i. it groups transactions by the shard where each should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect.

This was implemented ahead of this NEP.

## Security Implications

[Explicitly outline any security concerns in relation to the NEP, and potential ways to resolve or mitigate them. At the very least, well-known relevant threats must be covered, e.g. person-in-the-middle, double-spend, XSS, CSRF, etc.]

## Alternatives

* Why is this design the best in the space of possible designs?
* This design is the simplest, most robust and safe while meeting all of the requirements.
* What other designs have been considered and what is the rationale for not choosing them?
* Splitting the trie by iterating over the boundaries between children shards for each trie record type. This implementation has the potential to be faster but it is more complex and it would take longer to implement. We opted in for the much simpler one using flat storage given it is already quite performant.
* Changing the trie structure to have the account id first and type of record later. This change would allow for much faster resharding by only iterating over the nodes on the boundary. This approach has two major drawbacks. 1) It would require a massive migration. 2) We would need to maintain the old and the new trie structure forever.
* Changing the storage structure by having the storage key to have the format of account_id.node_hash. This structure would make it much easier to split the trie on storage level because the children shards are simple sub-ranges of the parent shard. Unfortunately we found that the migration would not be feasible.
* Changing the storage structure by having the key to have the format of only node_hash. This is a feasible approach but it adds complexity to the garbage collection and data deletion. We opted in for the much simpler one by using the existing scheme of prefixing storage entries by shard uid.
* What is the impact of not doing this?
* We need resharding in order to scale up the system. Without resharding eventually shards would grow so big (in either storage or cpu usage) that a single node would not be able to handle it.

## Integration with State Sync

There are two known issues in the integration of resharding and state sync:

* When syncing the state for the first epoch where the new shard layout is used. In this case the node would need to apply the last block of the previous epoch. It cannot be done on the children shard as on chain the block was applied on the parent shards and the trie related gas costs would be different.
* When generating proofs for incoming receipts. The proof for each of the children shards contains only the receipts of the shard but it's generated on the parent shard layout and so may not be verified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karim-en will this have an impact on the bridge?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like yes, but I don't understand how this could impact the Prover logic, so I would like to see the structure of the newly generated proof.
Also, it would be good if @walnut-the-cat had a look into our proof verification code and gave us some feedback.
https://github.com/aurora-is-near/rainbow-bridge/blob/bf7cc1e81ee4a4fad3e74e2c599b0b6cf3757f4d/contracts/eth/nearprover/contracts/NearProver.sol#L28

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@birchmd and @karim-en I had a quick look but I completely lack context. Perhaps it would be easier to perform an end to end test of the bridge and resharding on localnet or other small network? A simple setup to trigger resharding would be as follows:

  • prepare custom binary with resharding stabilized e.g. like so near/nearcore@fdcfedf
  • run 1.36 for an epoch or two
  • stop the node(s)
  • run the custom binary for at least three epochs
  • assuming the binary upgrade happens in epoch T, the resharding will happen in T+1, and the network will switch to the new shard layout in the first block of T+2

Please let me know if you need any help at all and we can look into it together.


In this NEP we propose that resharding should be rolled out first, before any real dependency on state sync is added. We can then safely roll out the resharding logic and solve the abovementioned issues separately.

## Integration with Stateless Validation

The Stateless Validation requires that chunk producers provide proofs of correctness of the transition function from one state root to another. That proof for the first block after the new shard layout takes place will need to prove that the entire state split was correct as well as the state transition.

In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the abovementioned issues separately.

## Future possibilities
wacban marked this conversation as resolved.
Show resolved Hide resolved

As noted above, dynamic resharding is out of scope for this NEP and should be implemented in the future. Dynamic resharding includes the following but not limited to:

* automatic determination of split boundary
* automatic shard splitting and merging based on traffic

Other useful features that can be considered as a follow up:

* account colocation for low latency across account call
* removal of shard uids and introducing globally unique shard ids
* shard on demand

## Consequences

### Positive

* Workload across shards will be more evenly distributed.
* Required space to maintain state (either in memory or in persistent disk) will be smaller.
* State sync overhead will be smaller.
* TBD

### Neutral

* Number of shards is expected to increase.
* Underlying trie structure and data structure are not going to change.
* Resharding will create dependency on flat storage, flat state snapshots and state sync. TODO - what dependency on state sync?

### Negative

* The resharding process is still not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated by a person.
* During resharding, a node is expected to do more work as it will first need to copy a lot of data around the then will have to apply changes twice (for the current shard and the future shard).
* Increased potential for apps and tools to break without proper shard layout change handling.

### Backwards Compatibility

We do not expect anything to break with this change. Yet, shard splitting can introduce additional complexity on replayability. For instance, as target shard of a receipt and belonging shard of an account can change with shard splitting, shard splitting must be replayed along with transactions at the exact epoch boundary.
wacban marked this conversation as resolved.
Show resolved Hide resolved

## Unresolved Issues (Optional)

[Explain any issues that warrant further discussion. Considerations

* What parts of the design do you expect to resolve through the NEP process before this gets merged?
* What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
* What related issues do you consider out of scope for this NEP that could be addressed in the future independently of the solution that comes out of this NEP?]

## Changelog

[The changelog section provides historical context for how the NEP developed over time. Initial NEP submission should start with version 1.0.0, and all subsequent NEP extensions must follow [Semantic Versioning](https://semver.org/). Every version should have the benefits and concerns raised during the review. The author does not need to fill out this section for the initial draft. Instead, the assigned reviewers (Subject Matter Experts) should create the first version during the first technical review. After the final public call, the author should then finalize the last version of the decision context.]

### 1.0.0 - Initial Version

> Placeholder for the context about when and who approved this NEP version.

#### Benefits

> List of benefits filled by the Subject Matter Experts while reviewing this version:

* Benefit 1
* Benefit 2

#### Concerns

> Template for Subject Matter Experts review for this version:
> Status: New | Ongoing | Resolved

| # | Concern | Resolution | Status |
| --: | :------ | :--------- | -----: |
| 1 | | | |
| 2 | | | |

## Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
Loading