near · walnut-the-cat · Jan 22, 2024 · Sep 19, 2023 · Sep 19, 2023 · Sep 19, 2023
@@ -0,0 +1,224 @@
+---
+NEP: 508
+Title: Resharding v2
+Authors: Waclaw Banasik, Shreyan Gupta, Yoon Hong
+Status: Draft
+DiscussionsTo: https://github.com/near/nearcore/issues/8992
+Type: Protocol
+Version: 1.0.0
+Created: 2022-09-19
+LastUpdated: 2023-09-19
+---
+
+## Summary
+
+This proposal introduces a new implementation for resharding and a new shard layout for the production networks. 
+
+In essence, this NEP is an extension of [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md), which was focused on splitting one shard into multiple shards. 
+
+We are introducing resharding v2, which supports one shard splitting into two within one epoch at a pre-determined split boundary. The NEP includes performance improvement to make resharding feasible under the current state as well as actual resharding in mainnet and testnet (To be specific, spliting shard 3 into two). 
+
+While the new approach addresses critical limitations left unsolved in NEP-40 and is expected to remain valid for foreseable future, it does not serve all usecases, such as dynamic resharding. 
+
+## Motivation
+
+Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded. In addition, with state sync and stateless validation, validators will not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. 
+
+## Specification
+
+### High level assumptions
+
+* Flat state is enabled.
+* Shard split boundary is predetermined. In other words, necessity of shard splitting is manually decided.
+* Merkle Patricia Trie is the undelying data structure for the protocol state.
+* Minimal epoch gap between two resharding events is X.
+* Some form of State Sync (centralized or decentralized) is enabled.
+
+### High level requirements
+
+* Resharding should work even when validators stop tracking all shards.
+* Resharding should work after stateless validation is enabled.
+* Resharding should be fast enough so that both state sync and resharding can happen within one epoch.
+* ~~Resharding should not require additional hardware from nodes.~~
+  * This needs to be assessed during test
+* Resharding should be fault tolerant
+  * Chain must not stall in case of resharding failure. TODO - this seems impossible under current assumptions because the shard layout for an epoch is committed to the chain before resharding is finished
+  * A validator should be able to recover in case they go offline during resharding.
+    * For now, our aim is at least allowing a validator to join back after resharding is finished. 
+* No transaction or receipt should be lost during resharding.
+* Resharding should work regardless of number of existing shards.
+* There should be no more place (in any apps or tools) where the number of shards is hardcoded.
+
+### Out of scope
+
+* Dynamic resharding
+  * automatically scheduling resharding based on shard usage/capacity
+  * automatically determining the shard layout
+* merging shards
+* shard reshuffling
+* shard boundary adjustment
+* Shard Layout determination logic (shard boundaries are still determined offline and hardcoded)
+* Advanced failure handling
+  * If a validator goes offline during resharding, it can join back immediately and move forward as long as enough time is left to reperform resharding.
+* TBD
+
+### Required protocol changes
+
+TBD. e.g. configuration changes we have to introduce
+
+A new protocol version will be introduced specifying the new shard layout. 
+
+### Required state changes
+
+TBD. e.g. additional/updated data a node has to maintain
+
+* For the duration of the resharding the node will need to maintain a snapshot of the flat state and related columns.
+* For the duration of the epoch before the new shard layout takes effect, the node will need to maintain the state and flat state of shards in the old and new layout at the same time. 
+
+### Resharding flow
+
+TBD. how resharding happens at the high level
+
+* The new shard layout will be agreed on offline by the protocol team and hardcoded in the neard reference implementation.
+* In epoch T the protocol version upgrade date will pass and nodes will vote to switch to the new protocol version. The new protocol version will contain the new shard layout. 
+* In epoch T, in the last block of the epoch, the EpochConfig for epoch T+2 will be set. The EpochConfig for epoch T+2 will have the new shard layout.
+* In epoch T + 1, all nodes will perform the state split. The child shards will be kept up to date with the blockchain up until the epoch end.
+* In epoch T + 2, the chain will switch to the new shard layout. 
+
+## Reference Implementation
+
+The implementation heavily re-uses the implementation from [NEP-40](https://github.com/near/NEPs/blob/master/specs/Proposals/0040-split-states.md). Below are listed only the major differences and additions. 
+
+### Flat Storage
+
+The old implementaion of resharding relied on iterating over the full state of the parent shard in order to build the state for the children shards. This implementation was suitable at the time but since then the state has grown considerably and this implementation is now too slow to fit within a single epoch. The new implementation relies on the flat storage in order to build the children shards quicker. Based on benchmarks, splitting one shard by using flat storage can take up to 15min.
+
+The new implementation will also propagate the flat storage for the children shards and keep it up to the with the chain until the switch to the new shard layout. The old implementation didn't handle this case because the flat storage didn't exist back then. 
+
+In order to ensure consistent view of the flat storage while splitting the state the node will maintain a snapshot of the flat state and related columns. The existing implementation of flat state snapshots used in State Sync will be adjusted for this purpose. 
+
+### Handling receipts, gas burnt and balance burnt
+
+When resharding, extra care should be taken when handling receipts in order to ensure that no receipts are lost or duplicated. The gas burnt and balance burnt also need to be correclty handled. The old resharding implementation for handling receipts, gas burnt and balance burnt relied on the fact in the first resharding there was only a single parent shard to begin with. The new implementation will provide a more generic and robust way of reassigning the receipts, gas burnt and balance burnt that works for arbitrary splitting of shards, regardless of the previous shard layout. 
+
+### New shard layout
+
+A new shard layout will be determined and will be scheduled and executed in the production networks. The new shard layout will maintain the same boundaries for shards 0, 1 and 2. The heaviest shard today - Shard 3 - will be split by introducing a new boundary account. The new boundary account will be determined by analysing the storage and gas usage within the shard and selecting a point that will divide the shard roughly in half in accordance to the mentioned metrics. Other metrics can also be used. 
+
+### Fixed shards
+
+Fixed shards is a feature of the protocol that allows for assigning specific accounts and all of their recursive sub accounts to a predetermined shard. This feature is only used for testing, it was never used in production and there is no need for it in production. This feature unfortunately breaks the contiguity of shards. A sub account of a fixed shard account can fall in the middle of account range that belongs to a different shard. This property of fixed shards makes it particularly hard to reason about and implement efficient resharding. 
+
+This was implemented ahead of this NEP.
+
+### Transaction pool
+
+The transaction pool is sharded e.i. it groups transactions by the shard where each should be converted to a receipt. The transaction pool was previously sharded by the ShardId. Unfortunately ShardId is insufficient to correctly identify a shard across a resharding event as ShardIds change domain. The transaction pool was migrated to group transactions by ShardUId instead and a transaction pool resharding was implemented to reassign transaction from parent shard to children shards right before the new shard layout takes effect. 
+
+This was implemented ahead of this NEP. 
+
+## Security Implications
+
+[Explicitly outline any security concerns in relation to the NEP, and potential ways to resolve or mitigate them. At the very least, well-known relevant threats must be covered, e.g. person-in-the-middle, double-spend, XSS, CSRF, etc.]
+
+## Alternatives
+
+* Why is this design the best in the space of possible designs?
+  * This design is the simplest, most robust and safe while meeting all of the requirements. 
+* What other designs have been considered and what is the rationale for not choosing them?
+  * Splitting the trie by iterating over the boundaries between children shards for each trie record type. This implementation has the potential to be faster but it is more complex and it would take longer to implement. We opted in for the much simpler one using flat storage given it is already quite performant. 
+  * Changing the trie structure to have the account id first and type of record later. This change would allow for much faster resharding by only iterating over the nodes on the boundary. This approach has two major drawbacks. 1) It would require a massive migration. 2) We would need to maintain the old and the new trie structure forever. 
+  * Changing the storage structure by having the storage key to have the format of account_id.node_hash. This structure would make it much easier to split the trie on storage level because the children shards are simple sub-ranges of the parent shard. Unfortunately we found that the migration would not be feasible. 
+  * Changing the storage structure by having the key to have the format of only node_hash. This is a feasible approach but it adds complexity to the garbage collection and data deletion. We opted in for the much simpler one by using the existing scheme of prefixing storage entries by shard uid. 
+* What is the impact of not doing this?
+  * We need resharding in order to scale up the system. Without resharding eventually shards would grow so big (in either storage or cpu usage) that a single node would not be able to handle it. 
+
+## Integration with State Sync
+
+There are two known issues in the integration of resharding and state sync:
+
+* When syncing the state for the first epoch where the new shard layout is used. In this case the node would need to apply the last block of the previous epoch. It cannot be done on the children shard as on chain the block was applied on the parent shards and the trie related gas costs would be different. 
+* When generating proofs for incoming receipts. The proof for each of the children shards contains only the receipts of the shard but it's generated on the parent shard layout and so may not be verified. 
+
+In this NEP we propose that resharding should be rolled out first, before any real dependency on state sync is added. We can then safely roll out the resharding logic and solve the abovementioned issues separately. 
+
+## Integration with Stateless Validation
+
+The Stateless Validation requires that chunk producers provide proofs of correctness of the transition function from one state root to another. That proof for the first block after the new shard layout takes place will need to prove that the entire state split was correct as well as the state transition. 
+
+In this NEP we propose that resharding should be rolled out first, before stateless validation. We can then safely roll out the resharding logic and solve the abovementioned issues separately. 
+
+## Future possibilities
+
+As noted above, dynamic resharding is out of scope for this NEP and should be implemented in the future. Dynamic resharding includes the following but not limited to:
+
+* automatic determination of split boundary
+* automatic shard splitting and merging based on traffic
+
+Other useful features that can be considered as a follow up:
+
+* account colocation for low latency across account call
+* removal of shard uids and introducing globally unique shard ids
+* shard on demand
+
+## Consequences
+
+### Positive
+
+* Workload across shards will be more evenly distributed.
+* Required space to maintain state (either in memory or in persistent disk) will be smaller.
+* State sync overhead will be smaller.
+* TBD
+
+### Neutral
+
+* Number of shards is expected to increase.
+* Underlying trie structure and data structure are not going to change.
+* Resharding will create dependency on flat storage, flat state snapshots and state sync. TODO - what dependency on state sync? 
+
+### Negative
+
+* The resharding process is still not fully automated. Analyzing shard data, determining the split boundary, and triggering an actual shard split all need to be manually curated by a person.
+* During resharding, a node is expected to do more work as it will first need to copy a lot of data around the then will have to apply changes twice (for the current shard and the future shard).
+* Increased potential for apps and tools to break without proper shard layout change handling.
+
+### Backwards Compatibility
+
+We do not expect anything to break with this change. Yet, shard splitting can introduce additional complexity on replayability. For instance, as target shard of a receipt and belonging shard of an account can change with shard splitting, shard splitting must be replayed along with transactions at the exact epoch boundary.
+
+## Unresolved Issues (Optional)
+
+[Explain any issues that warrant further discussion. Considerations
+
+* What parts of the design do you expect to resolve through the NEP process before this gets merged?
+* What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
+* What related issues do you consider out of scope for this NEP that could be addressed in the future independently of the solution that comes out of this NEP?]
+
+## Changelog
+
+[The changelog section provides historical context for how the NEP developed over time. Initial NEP submission should start with version 1.0.0, and all subsequent NEP extensions must follow [Semantic Versioning](https://semver.org/). Every version should have the benefits and concerns raised during the review. The author does not need to fill out this section for the initial draft. Instead, the assigned reviewers (Subject Matter Experts) should create the first version during the first technical review. After the final public call, the author should then finalize the last version of the decision context.]
+
+### 1.0.0 - Initial Version
+
+> Placeholder for the context about when and who approved this NEP version.
+
+#### Benefits
+
+> List of benefits filled by the Subject Matter Experts while reviewing this version:
+
+* Benefit 1
+* Benefit 2
+
+#### Concerns
+
+> Template for Subject Matter Experts review for this version:
+> Status: New | Ongoing | Resolved
+
+|   # | Concern | Resolution | Status |
+| --: | :------ | :--------- | -----: |
+|   1 |         |            |        |
+|   2 |         |            |        |
+
+## Copyright
+
+Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).