-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archive: Introduce archive_storageDiff for indexers #159
Conversation
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
I wonder about the performance implications of eg not providing prefixes (or providing many or large ones), but think that the streaming makes sense as a way of at least mitigating the performance hit of this call on the node! As an API though, off the top of my head it looks solid and feels like it would cover the use cases I can think of! |
- `prefixes` (optional): Array of JSON objects describing how the storage difference will be calculated. Each object contains the following fields: | ||
- `key`: String containing the hexadecimal-encoded key prefix under which the storage difference is calculated. | ||
- `type`: String equal to one of: `value`, `hash`, `none`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Calculating under a prefix" doesn't mean anything to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added a new sentace to describe this a bit better: " Only the storage entries whose key starts with the provided prefix are returned." , thanks!
- `type`: String equal to one of: `value`, `hash`, `none`. | ||
- containing the key prefixes for which the storage difference will be calculated. If this parameter is not provided, the storage difference is calculated for all keys. | ||
- `excludeKeyPrefixes` (optional): Array of strings containing the key prefixes for which the storage difference will not be calculated. If this parameter is not provided, the storage difference is calculated for all keys. | ||
- `childTrie` (optional): A string containing the hexadecimal-encoded key of the child trie of the "default" namespace. If this parameter is not provided, the storage difference is calculated for the main storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that you just put this parameter here without giving much thoughts to it in order to get child tries out of the way, but it is very incoherent with the rest of the design of the function.
The function lets you query the difference between a range of blocks and many different prefixes, but then you would have to send a separate JSON-RPC request for each child trie one by one? That's extremely weird to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, I wondering as well if that's the right place to put it. Have extended it similar to:
items [
{
(optional) prefixes...
trieType: mainTrie | childTrie
(optional) childTrieKey
}
]
This should provide the storage difference of multiple tries.
I've added the nonoptional trieType
because it felt odd for users to say:
items: [ { }, { "childKey": ...}]
(ie have an empty object for main trie queries)
- `excludeKeyPrefixes` (optional): Array of strings containing the key prefixes for which the storage difference will not be calculated. If this parameter is not provided, the storage difference is calculated for all keys. | ||
- `childTrie` (optional): A string containing the hexadecimal-encoded key of the child trie of the "default" namespace. If this parameter is not provided, the storage difference is calculated for the main storage. | ||
|
||
**Return value**: String containing an opaque value representing the operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of simplicity, I've been trying to avoid subscriptions for all archive
-related functions.
The subscription functions are important for chainHead
, because you want be able to cancel operations in order to keep bandwidth usage to a minimum, and because you want to be able to cancel operations related to blocks that end up not being canonical.
However, none of these two things apply here (and don't apply for any archive
-prefix function), so I don't really see the point of having a subscription.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that the point of the subscription here is that calculating the diffs can be quite slow, so by streaming them (there can be a bunch of prefixes asked for) the user can stop
them (and be asked to continue
) if they find what they need or decide they don't need the server to continue working?
I agree though that it'd be much nicer to avoid the subscriptions though, so if this isn't an issue then it'd be great to simplify it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they find what they need or decide they don't need the server to continue working?
In my opinion it is a better idea for the client to send queries progressively, for example query from block 1 to 100, then block 101 to 200, then block 201 to 300, etc.
It is not so different to do so compared to sending a "continue", except that the client has more control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like this approach! Turning this subscription into a method is easier to follow from the user's perspective. I'll do some performance measurements for storage differences once we implement this in Substrate.
We could probably add a disk cache to make queries a bit faster if that turns to be a performance issue
|
||
**Parameters**: | ||
|
||
- `currentHash`: String containing a hexadecimal-encoded hash of the header of the block whose storage difference will be retrieved. The storage difference is calculated between the `currentHash` block and the parent of the `currentHash` block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I don't think that the naming is appropriate. What exactly is "current" here, given that the block can be very old?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that makes sense, have applied James suggestion of naming currentHash
-> hash
and fromHash
-> previousHash
🙏
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
{ | ||
"prefixes": [ | ||
{ | ||
"key": "0x...", | ||
"type": "value" | "hash" | "none", | ||
}, | ||
], | ||
|
||
"trieType": "mainTrie" | "childTrie", | ||
"childTrieKey": "0x...", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a little off to me that we have two arrays nested inside eachother in order to presumably support child tree access. I wonder whether we could simplify items to be:
"items": [
{
"key": "0x...",
"returnType": "value" | "hash" | "none",
"childTrie": "0x..." | null,
}
]
If "childTrie" is provided and not null, then we try accessing said child trie, else we assume we're trying to access the main trie (which I guess is the overwhelming default for most, but not 100% sure!).
I also renamed "type" to "returnType" just because it felt a bit clearer (ie we are not defining the type of the key or anything, just the type of the things returned from this), but happy with either!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that makes sense! While at it, I've made key
and childTrie
optional
The JSON object returned by this function has the following format: | ||
|
||
```json | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did the above, we'd also tweak this to be a single array whose entries contained a "childTrieKey" if one was provided in the input for that key ie
"result": [
{
"key": "0x...",
"value": "0x...",
"hash": "0x...",
"type": "added" | "modified" | "deleted",
"childTrie": "0x..." | null,
},
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that "prefixes" is optional. If it's not provided, is the point that this method will provide all chanegs in the whole trie?
So either this suggestion wouldn't allow that, or one could set a key like "0x" (or not provide one) to indicate that we are searching for diffs under everything in that trie? (This would allow for the type to be specified I guess too, which isn't currently possible if no prefixes are given)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think the API looks cleaner with this suggestion, I made the key
optional instead of 0x
🙏
|
||
The `differences` field is an array of objects describing storage differences. Each element contains the following fields: | ||
|
||
- `key`: String containing the hexadecimal-encoded key of the storage entry. If the key prefix was provided in the `items` parameter, the `key` field is guaranteed to start with one of the key prefixes provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the key prefix was provided in the
items
parameter, thekey
field is guaranteed to start with one of the key prefixes provided.
What if the key prefix was not provided in the items
parameter?
Or perhaps this is aiming for something more like: "String containing the hexadecimal-encoded key of the storage entry. A prefix of this key will have been provided in the items
input"? I might not 100% understand!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the suggestion clarifies things, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not that familiar with the storage and child storage stuff to understand why the API actually needs to expose whether the storage keys/prefixes is a main or child storage.
It's quite complicated API to use in my opinion but most likely required for a reason that I don't understand but I would be much happier to just have not expose "trieType" and "childTrieKey" and let node just inspect the filter of the storage keys.
"items": [
{
"key": "0x1.",
"type": "value" | "hash" | "none",
}
{
"key": "0x2",
"type": "value" | "hash" | "none",
}
]
In the same manner, make "items" and "trieType" mutually exclusive.
Such one can query:
- [some storage keys],
- main storage
- child storage.
Then also I would be happy to make "type" optional and let the node display the key i.e, "hash" or "value"
I haven't looked at the implementation but just my comments of the overall API.
@lexnv just bumping this incase it fell off the radar! It'd be good to address the comments and get it through review so we can implement it, but I know you're prob busy so not urgent :) |
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just one sentence I spotted that needed a wee update after other changes :)
Signed-off-by: Alexandru Vasile <[email protected]>
@tomaka Would love to get your thoughts on this, thanks 🙏 |
So this looks good, but I'm still hesitant about not using a subscription here. If a client asks for a storage diff that is particularly large (ie between two quite distant blocks) I imagine it could be expensive to compute (in terms of the size of response needed as well as time taken to build it). Would this open the node to DoS attacks (clients asking for storage diffs known to be huge)? Or, to avoid this, would some storage diffs simply lead to an error because they are too expensive to compute and so the node refuses? With a subscription based approach we could do something like progressively streaming the diffs we see as we work from first block to last block. The server wouldn't have to accumulate so much in one go that way, and could also take advantage of backpressure to kill connections to clients that were too slow to receive data back or whatever. @tomaka I think you preferred a non subscription based approach so perhaps you have an idea already on how to avoid these issues (or maybe don't think that there is any issue in the first place) without needing subscriptions? |
Merging this to focus a bit on the implementation, we'll re-evaluate the subscription-based approach after we extract some performance metrics from the new RPC methods, thanks everyone for the feedback and review 🙏 Please feel free to open any issues / PRs to further discuss this, thanks again! |
This PR introduces a new unstable method to the archive class.
The method
archive_unstable_storageDiff
aims to help indexers in providing storage differences between two blocks.By default, the method returns the difference between
currentHash
and its parent.Furthermore, users have the ability to specify the keys of interest:
For each individual prefix, the users can specify if they need:
value
: Returns the value of the key from thecurrentHash
blockhash
: Similar to value but returns the hashnone
: Users are only interested to see if the key was modified, added or removed.For symmetry with the chainHead functions:
storageDiffStop
method is used to stop the subscriptionstorageDiffContinue
method is used for resuming storage differencesCloses: #108
cc @paritytech/subxt-team @tomaka @jsdw @josepot