Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive: Introduce archive_storageDiff for indexers #159

Merged
merged 18 commits into from
Sep 30, 2024
Merged

Conversation

lexnv
Copy link
Contributor

@lexnv lexnv commented Aug 1, 2024

This PR introduces a new unstable method to the archive class.

The method archive_unstable_storageDiff aims to help indexers in providing storage differences between two blocks.

By default, the method returns the difference between currentHash and its parent.

Furthermore, users have the ability to specify the keys of interest:

  • prefixes: A list of key prefixes under which the users will receive notifications
  • exclude key prefixes: A list of key prefixes to exclude from the results
  • childTrie: Useful for computing the difference for a child trie instead of the main storage trie

For each individual prefix, the users can specify if they need:

  • value: Returns the value of the key from the currentHash block
  • hash: Similar to value but returns the hash
  • none: Users are only interested to see if the key was modified, added or removed.

For symmetry with the chainHead functions:

  • a storageDiffStop method is used to stop the subscription
  • a storageDiffContinue method is used for resuming storage differences

Closes: #108

cc @paritytech/subxt-team @tomaka @jsdw @josepot

@lexnv lexnv self-assigned this Aug 1, 2024
@jsdw
Copy link
Collaborator

jsdw commented Aug 5, 2024

I wonder about the performance implications of eg not providing prefixes (or providing many or large ones), but think that the streaming makes sense as a way of at least mitigating the performance hit of this call on the node! As an API though, off the top of my head it looks solid and feels like it would cover the use cases I can think of!

src/api/archive_unstable_storageDiff.md Outdated Show resolved Hide resolved
src/api/archive_unstable_storageDiff.md Outdated Show resolved Hide resolved
Comment on lines 27 to 29
- `prefixes` (optional): Array of JSON objects describing how the storage difference will be calculated. Each object contains the following fields:
- `key`: String containing the hexadecimal-encoded key prefix under which the storage difference is calculated.
- `type`: String equal to one of: `value`, `hash`, `none`.
Copy link
Contributor

@tomaka tomaka Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Calculating under a prefix" doesn't mean anything to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added a new sentace to describe this a bit better: " Only the storage entries whose key starts with the provided prefix are returned." , thanks!

src/api/archive_unstable_storageDiff.md Outdated Show resolved Hide resolved
- `type`: String equal to one of: `value`, `hash`, `none`.
- containing the key prefixes for which the storage difference will be calculated. If this parameter is not provided, the storage difference is calculated for all keys.
- `excludeKeyPrefixes` (optional): Array of strings containing the key prefixes for which the storage difference will not be calculated. If this parameter is not provided, the storage difference is calculated for all keys.
- `childTrie` (optional): A string containing the hexadecimal-encoded key of the child trie of the "default" namespace. If this parameter is not provided, the storage difference is calculated for the main storage.
Copy link
Contributor

@tomaka tomaka Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that you just put this parameter here without giving much thoughts to it in order to get child tries out of the way, but it is very incoherent with the rest of the design of the function.
The function lets you query the difference between a range of blocks and many different prefixes, but then you would have to send a separate JSON-RPC request for each child trie one by one? That's extremely weird to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, I wondering as well if that's the right place to put it. Have extended it similar to:

items [
 {
    (optional) prefixes...
    trieType: mainTrie | childTrie
    (optional) childTrieKey
 } 
]

This should provide the storage difference of multiple tries.
I've added the nonoptional trieType because it felt odd for users to say:
items: [ { }, { "childKey": ...}] (ie have an empty object for main trie queries)

- `excludeKeyPrefixes` (optional): Array of strings containing the key prefixes for which the storage difference will not be calculated. If this parameter is not provided, the storage difference is calculated for all keys.
- `childTrie` (optional): A string containing the hexadecimal-encoded key of the child trie of the "default" namespace. If this parameter is not provided, the storage difference is calculated for the main storage.

**Return value**: String containing an opaque value representing the operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of simplicity, I've been trying to avoid subscriptions for all archive-related functions.
The subscription functions are important for chainHead, because you want be able to cancel operations in order to keep bandwidth usage to a minimum, and because you want to be able to cancel operations related to blocks that end up not being canonical.
However, none of these two things apply here (and don't apply for any archive-prefix function), so I don't really see the point of having a subscription.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that the point of the subscription here is that calculating the diffs can be quite slow, so by streaming them (there can be a bunch of prefixes asked for) the user can stop them (and be asked to continue) if they find what they need or decide they don't need the server to continue working?

I agree though that it'd be much nicer to avoid the subscriptions though, so if this isn't an issue then it'd be great to simplify it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they find what they need or decide they don't need the server to continue working?

In my opinion it is a better idea for the client to send queries progressively, for example query from block 1 to 100, then block 101 to 200, then block 201 to 300, etc.
It is not so different to do so compared to sending a "continue", except that the client has more control.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like this approach! Turning this subscription into a method is easier to follow from the user's perspective. I'll do some performance measurements for storage differences once we implement this in Substrate.

We could probably add a disk cache to make queries a bit faster if that turns to be a performance issue


**Parameters**:

- `currentHash`: String containing a hexadecimal-encoded hash of the header of the block whose storage difference will be retrieved. The storage difference is calculated between the `currentHash` block and the parent of the `currentHash` block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't think that the naming is appropriate. What exactly is "current" here, given that the block can be very old?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that makes sense, have applied James suggestion of naming currentHash -> hash and fromHash -> previousHash 🙏

@lexnv lexnv requested review from jsdw and tomaka August 20, 2024 15:41
Comment on lines 13 to 23
{
"prefixes": [
{
"key": "0x...",
"type": "value" | "hash" | "none",
},
],

"trieType": "mainTrie" | "childTrie",
"childTrieKey": "0x...",
},
Copy link
Collaborator

@jsdw jsdw Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a little off to me that we have two arrays nested inside eachother in order to presumably support child tree access. I wonder whether we could simplify items to be:

"items": [
    {
        "key": "0x...",
        "returnType": "value" | "hash" | "none",
        "childTrie": "0x..." | null,
    }
]

If "childTrie" is provided and not null, then we try accessing said child trie, else we assume we're trying to access the main trie (which I guess is the overwhelming default for most, but not 100% sure!).

I also renamed "type" to "returnType" just because it felt a bit clearer (ie we are not defining the type of the key or anything, just the type of the things returned from this), but happy with either!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that makes sense! While at it, I've made key and childTrie optional

The JSON object returned by this function has the following format:

```json
{
Copy link
Collaborator

@jsdw jsdw Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did the above, we'd also tweak this to be a single array whose entries contained a "childTrieKey" if one was provided in the input for that key ie

"result": [
        {
            "key": "0x...",
            "value": "0x...",
            "hash": "0x...",
            "type": "added" | "modified" | "deleted",
            "childTrie": "0x..." | null,
        },
]

Copy link
Collaborator

@jsdw jsdw Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that "prefixes" is optional. If it's not provided, is the point that this method will provide all chanegs in the whole trie?

So either this suggestion wouldn't allow that, or one could set a key like "0x" (or not provide one) to indicate that we are searching for diffs under everything in that trie? (This would allow for the type to be specified I guess too, which isn't currently possible if no prefixes are given)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think the API looks cleaner with this suggestion, I made the key optional instead of 0x 🙏


The `differences` field is an array of objects describing storage differences. Each element contains the following fields:

- `key`: String containing the hexadecimal-encoded key of the storage entry. If the key prefix was provided in the `items` parameter, the `key` field is guaranteed to start with one of the key prefixes provided.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the key prefix was provided in the items parameter, the key field is guaranteed to start with one of the key prefixes provided.

What if the key prefix was not provided in the items parameter?

Or perhaps this is aiming for something more like: "String containing the hexadecimal-encoded key of the storage entry. A prefix of this key will have been provided in the items input"? I might not 100% understand!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep the suggestion clarifies things, thanks!

Copy link
Member

@niklasad1 niklasad1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that familiar with the storage and child storage stuff to understand why the API actually needs to expose whether the storage keys/prefixes is a main or child storage.

It's quite complicated API to use in my opinion but most likely required for a reason that I don't understand but I would be much happier to just have not expose "trieType" and "childTrieKey" and let node just inspect the filter of the storage keys.

"items": [
    {
                "key": "0x1.",
                "type": "value" | "hash" | "none",
    }
    {
                "key": "0x2",
                "type": "value" | "hash" | "none",
    }
]

In the same manner, make "items" and "trieType" mutually exclusive.

Such one can query:

  1. [some storage keys],
  2. main storage
  3. child storage.

Then also I would be happy to make "type" optional and let the node display the key i.e, "hash" or "value"
I haven't looked at the implementation but just my comments of the overall API.

@jsdw
Copy link
Collaborator

jsdw commented Sep 11, 2024

@lexnv just bumping this incase it fell off the radar! It'd be good to address the comments and get it through review so we can implement it, but I know you're prob busy so not urgent :)

jsdw
jsdw previously approved these changes Sep 17, 2024
Copy link
Collaborator

@jsdw jsdw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just one sentence I spotted that needed a wee update after other changes :)

@lexnv
Copy link
Contributor Author

lexnv commented Sep 20, 2024

@tomaka Would love to get your thoughts on this, thanks 🙏

@jsdw
Copy link
Collaborator

jsdw commented Sep 20, 2024

So this looks good, but I'm still hesitant about not using a subscription here.

If a client asks for a storage diff that is particularly large (ie between two quite distant blocks) I imagine it could be expensive to compute (in terms of the size of response needed as well as time taken to build it). Would this open the node to DoS attacks (clients asking for storage diffs known to be huge)? Or, to avoid this, would some storage diffs simply lead to an error because they are too expensive to compute and so the node refuses?

With a subscription based approach we could do something like progressively streaming the diffs we see as we work from first block to last block. The server wouldn't have to accumulate so much in one go that way, and could also take advantage of backpressure to kill connections to clients that were too slow to receive data back or whatever.

@tomaka I think you preferred a non subscription based approach so perhaps you have an idea already on how to avoid these issues (or maybe don't think that there is any issue in the first place) without needing subscriptions?

@lexnv
Copy link
Contributor Author

lexnv commented Sep 30, 2024

Merging this to focus a bit on the implementation, we'll re-evaluate the subscription-based approach after we extract some performance metrics from the new RPC methods, thanks everyone for the feedback and review 🙏

Please feel free to open any issues / PRs to further discuss this, thanks again!

@lexnv lexnv merged commit 9491d90 into main Sep 30, 2024
3 checks passed
@lexnv lexnv deleted the lexnv/archive_diff branch September 30, 2024 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RPC to return storage diff between start_block and end_block
4 participants