diff --git a/neps/nep-0568.md b/neps/nep-0568.md index 07ce68761..b44878e81 100644 --- a/neps/nep-0568.md +++ b/neps/nep-0568.md @@ -107,6 +107,28 @@ Splitting a shard's Flat State is performed in multiple steps: snapshots and to reload Mem Tries. ### State Storage - State +// TODO Describe integration with cold storage once design is ready + +Each shard’s Trie is stored in the `State` column of the database, with keys prefixed by `ShardUId`, followed by a node's hash. +This structure uniquely identifies each shard’s data. To avoid copying all entries under a new `ShardUId` during resharding, +a mapping strategy allows child shards to access ancestor shard data without directly creating new entries. + +A naive approach to resharding would involve copying all `State` entries with a new `ShardUId` for a child shard, effectively duplicating the state. +This method, while straightforward, is not feasible because copying a large state would take too much time. +Resharding needs to appear complete between two blocks, so a direct copy would not allow the process to occur quickly enough. + +To address this, Resharding V3 employs an efficient mapping strategy, using the `DBCol::ShardUIdMapping` column +to link each child shard’s `ShardUId` to the closest ancestor’s `ShardUId` holding the relevant data. +This allows child shards to access and update state data under the ancestor shard’s prefix without duplicating entries. + +Initially, `ShardUIdMapping` is empty, as existing shards map to themselves. During resharding, a mapping entry is added to `ShardUIdMapping`, +pointing each child shard’s `ShardUId` to the appropriate ancestor. Mappings persist as long as any descendant shard references the ancestor’s data. +Once a node stops tracking all children and descendants of a shard, the entry for that shard can be removed, allowing its data to be garbage collected. +For archival nodes, mappings are retained indefinitely to maintain access to the full historical state. + +This mapping strategy enables efficient shard management during resharding events, +supporting smooth transitions without altering storage structures directly. + ### Stateless Validation @@ -134,6 +156,90 @@ Splitting a shard's Flat State is performed in multiple steps: The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.] ``` +### State Storage - State mapping + +To enable efficient shard state management during resharding, Resharding V3 uses the `DBCol::ShardUIdMapping` column. +This mapping allows child shards to reference ancestor shard data, avoiding the need for immediate duplication of state entries. + +#### Mapping application in adapters + +The core of the mapping logic is applied in `TrieStoreAdapter` and `TrieStoreUpdateAdapter`, which act as layers over the general `Store` interface. +Here’s a breakdown of the key functions involved: + +- **Key resolution**: + The `get_key_from_shard_uid_and_hash` function is central to determining the correct `ShardUId` for state access. + At a high level, operations use the child shard's `ShardUId`, but within this function, + the `DBCol::ShardUIdMapping` column is checked to determine if an ancestor `ShardUId` should be used instead. + + ```rust + fn get_key_from_shard_uid_and_hash( + store: &Store, + shard_uid: ShardUId, + hash: &CryptoHash, + ) -> [u8; 40] { + let mapped_shard_uid = store + .get_ser::(DBCol::StateShardUIdMapping, &shard_uid.to_bytes()) + .expect("get_key_from_shard_uid_and_hash() failed") + .unwrap_or(shard_uid); + let mut key = [0; 40]; + key[0..8].copy_from_slice(&mapped_shard_uid.to_bytes()); + key[8..].copy_from_slice(hash.as_ref()); + key + } + ``` + + This function first attempts to retrieve a mapped ancestor `ShardUId` from `DBCol::ShardUIdMapping`. + If no mapping exists, it defaults to the provided child `ShardUId`. + This resolved `ShardUId` is then combined with the `node_hash` to form the final key used in `State` column operations. + +- **State access operations**: + The `TrieStoreAdapter` and `TrieStoreUpdateAdapter` use `get_key_from_shard_uid_and_hash` to correctly resolve the key for both reads and writes. + Example methods include: + + ```rust + // In TrieStoreAdapter + pub fn get(&self, shard_uid: ShardUId, hash: &CryptoHash) -> Result, StorageError> { + let key = get_key_from_shard_uid_and_hash(self.store, shard_uid, hash); + self.store.get(DBCol::State, &key) + } + + // In TrieStoreUpdateAdapter + pub fn increment_refcount_by( + &mut self, + shard_uid: ShardUId, + hash: &CryptoHash, + data: &[u8], + increment: NonZero, + ) { + let key = get_key_from_shard_uid_and_hash(self.store, shard_uid, hash); + self.store_update.increment_refcount_by(DBCol::State, key.as_ref(), data, increment); + } + ``` + The `get` function retrieves data using the resolved `ShardUId` and key, while `increment_refcount_by` manages reference counts, + ensuring correct tracking even when accessing data under an ancestor shard. + +#### Mapping retention and cleanup + +Mappings in `DBCol::ShardUIdMapping` persist as long as any descendant relies on an ancestor’s data. +To manage this, the `set_shard_uid_mapping` function in `TrieStoreUpdateAdapter` adds a new mapping during resharding: +```rust +fn set_shard_uid_mapping(&mut self, child_shard_uid: ShardUId, parent_shard_uid: ShardUId) { + self.store_update.set( + DBCol::StateShardUIdMapping, + child_shard_uid.to_bytes().as_ref(), + &borsh::to_vec(&parent_shard_uid).expect("Borsh serialize cannot fail"), + ) +} +``` + +When a node stops tracking all descendants of a shard, the associated mapping entry can be removed, allowing RocksDB to perform garbage collection. +For archival nodes, mappings are retained permanently to ensure access to the historical state of all shards. + +This implementation ensures efficient and scalable shard state transitions, +allowing child shards to use ancestor data without creating redundant entries. + + + ## Security Implications ```text