Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

op-program: add experimental L2 source flag to enable execution witnesses #12558

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

meyer9
Copy link
Contributor

@meyer9 meyer9 commented Oct 22, 2024

Description

In order to start testing the replacement of debug_dbGet with debug_executionWitness, we need to allow op-program to be configured to use the experimental source without affecting current use cases of the tool.

Because the EL client that will be running debug_executionWitness may be running less stable code, we should allow specifying a separate experimental RPC URL which will fall back to the canonical URL if it fails.

  • Adds --l2.experimental.enabled=<true|false> flag to op-program which enables an experimental client implementation.
  • Adds --l2.experimental=<url> flag to specify the RPC URL of the client to use.

Tests

No tests because nothing functionally changes with this PR.

Additional context

Here are some questions I have for context:

  • I increased the call timeout for RPC - should we make this configurable?
  • Should L2 experimental client URL be consolidated into canonical URL? Separation of the clients in code makes auditing easy.

Next PR will include implementation for this flag.

Comment on lines +59 to +63
// L2URL is the URL of the L2 node to fetch L2 data from, this is the canonical URL for L2 data
// in the case of L2ExperimentalEnabled = true, this URL is used as a fallback if the experimental URL fails or cannot retrieve the desired data
L2URL string
// L2ExperimentalURL is the URL of the L2 node (non hash db archival node, for example, reth archival node) to fetch L2 data from
L2ExperimentalURL string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just use a single L2URL. The fallback is kind of cool but I think we're just adding unnecessary complexity. The L2ExperimentalEnabled flag is useful so we can disable using witnesses entirely.

Then we can first enable witnesses on our vm-runner box that just runs cannon/op-program repeatedly outside of dispute games first. That can be pointed at an experimental version of op-geth if needed and it will build up confidence that it all works properly. Then we can actually enable it for the real op-challenger.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what does vm-runner do exactly? How did you set it up?

We don't have something like this, and the current reason for having this experimental URL is to allow it to run on devnet / testnets so that we can use real data to test if it works.

Still think this might be helpful though? As

  1. This does not change existing code path, you can just configure L2URL and go with the original route
  2. it could help add more testing surface (having it run on testnet or even mainnet without worrying about breaking things)
  3. And it's relatively easy (when we have confidence) to remove it and fallback to single client configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept this in for now, but should be easy to remove if we decide to remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wrote an answer to this yesterday and somehow managed to not post it.

vm-runner is actually the op-challenger run-trace subcommand which can run the trace provider op-challenger uses over and over based on real chain data. We have a couple of boxes in the cloud that run it pointed at op-sepolia and op-mainnet to verify various different configurations correctly validate the chain. So we run cannon with the op-program prestate actually specified in the FaultDisputeGame on chain, cannon with an updated op-program prestate similar to what will be deployed with holocene (though holocene not yet active), a multithreaded cannon prestate and asterisc/kona.

There's two problems with trying to deploy this to production networks:

  1. You need to update the prestate used on chain so it will emit the new hints. That requires a governance vote to approved the changes.
  2. cannon never actually runs in production because dispute games only get down to the split depth when we deliberately force it to (and spend a lot of ETH in the process).

So I definitely like the idea of a feature toggle to control whether or not we call executionWitness but I don't actually see the need for a separate L2 URL.

Copy link
Collaborator

@0x00101010 0x00101010 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, our case is a little bit different as we want to use reth as the experimental back and geth as the fallback due to geth state growth issues we're facing. Basically our disk size is projected to exceed 30T in Q1 next year and after that point it's gonna be super hard to run a geth hashdb archival node.

That's why we need a different url to provide experimental apis (by reth)

Also

You need to update the prestate used on chain so it will emit the new hints. That requires a governance vote to approved the changes.

From Dockerfile.repro, it seems like any commits to op-program could potentially change the reproducible-prestate? Curious what actually changes the absolute prestate?

cannon never actually runs in production because dispute games only get down to the split depth when we deliberately force it to (and spend a lot of ETH in the process).

This is expected as it's very expensive, is there a good way to test this feature change? My understanding is that cannon execution into specific steps does not change anything as the block we execute would get all the requried input from execution witness and eth_getProof and have them cached in the oracle, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, our case is a little bit different as we want to use reth as the experimental back and geth as the fallback

Ah yeah, that a good reason to justify the extra code. Makes sense.

From Dockerfile.repro, it seems like any commits to op-program could potentially change the reproducible-prestate? Curious what actually changes the absolute prestate?

Yes, any changes to the op-program client will result in a different prestate generated from it. Typically I assume that to get the exact same prestate you have to be on the same commit. But to actually use a different prestate on chain, that new version needs to be set as the absolute prestate in the FaultDisputeGame contract on L1. That requires the ProxyAdminOwner to sign a transaction and should only be done for governance approved versions.

This is expected as it's very expensive, is there a good way to test this feature change? My understanding is that cannon execution into specific steps does not change anything as the block we execute would get all the requried input from execution witness and eth_getProof and have them cached in the oracle, no?

Because there's a new hint type emitted from the client, this will require a new prestate so will change the specific steps that cannon executes. We can switch back and forth between how we fetch the data without any issues once that extra hint is present, it's just adding the hint that changes the actual execution. We'll need to update the prestate with Holocene to support that hard fork anyway so it's pretty likely that emitting the new hint can ship with that (and then we can decide when to depend on executionWitness to actually fetch the data independently).

So the way I'd test this is with the run-trace subcommand. We can easily deploy a new prestate there that emits the hint and still have it verify the actual L2 blocks. It won't produce the same final cannon state as the on-chain version but it should still correctly determine if the proposed output root is valid or not. We can also monitor the logs for any cases where it needs to fall back to debug_dbGet.

panic("unsupported")
}

func (s *L2ExperimentalClient) ExecutionWitness(ctx context.Context, blockNum uint64) (*eth.ExecutionWitness, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just build this into the normal L2Client rather than needing to extend. It's just up to the caller whether they use this or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here: db28421

@meyer9 meyer9 force-pushed the meyer9/feat/l2-experimental-client-support branch from 7bd0cd0 to d2926cb Compare October 23, 2024 16:47
@meyer9 meyer9 force-pushed the meyer9/feat/l2-experimental-client-support branch from d2926cb to db28421 Compare October 23, 2024 16:49
@meyer9 meyer9 marked this pull request as ready for review October 23, 2024 16:54
@meyer9 meyer9 requested a review from ajsutton October 23, 2024 16:58
Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally. Just some fairly minor simplifications I think we can make and would be good to have tests for the new CLI flags.

Comment on lines +50 to +59
L2NodeExperimentalAddr = &cli.StringFlag{
Name: "l2.experimental",
Usage: "Address of L2 JSON-RPC endpoint to use for experimental features (debug_executionWitness)",
EnvVars: prefixEnvVars("L2_RPC_EXPERIMENTAL_RPC"),
}
L2NodeExperimentalEnabled = &cli.BoolFlag{
Name: "l2.experimental.enabled",
Usage: "Enable experimental features on the L2 JSON-RPC endpoint, will fallback to L2NodeAddr if not able to retrieve desired data",
EnvVars: prefixEnvVars("L2_RPC_EXPERIMENTAL_ENABLED"),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some tests in main_test.go for these new flags please? It's amazing how often those unit tests have found problems with flags...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since L2NodeExperimentalAddr is required if enabled is true, we should probably just remove l2.experimental.enabled and enable the experimental features if the experimental adds is supplied. Less opportunity to create invalid config that way.

func NewL2Source(ctx context.Context, logger log.Logger, config *config.Config) (*L2Source, error) {
logger.Info("Connecting to canonical L2 source", "url", config.L2URL)

// eth_getProof calls are expensive and takes time, so we use a longer timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is only true for reth, geth can respond to eth_getProof quickly. btw, it would be worth testing how op-reth handles a eth_getProof request for a quite old block. Typically the anchor state is about 3.5 days old currently, but that will likely become 7 days with some improvements coming to only update it after the air gap so it's more reliable. And it could be longer than that if proposals haven't happened for a while. I suspect op-reth will have some work to do to optimise getProof if it's going to be the only source of data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is what we found (with a 2 second block time):

  • block 12 hrs ago: execution witness took 41s, account proof took 1m9s
  • block 24 hrs ago: execution witness took 1m48s, account proof took 4m
  • block 6 days ago: execution witness took 3m33s, account proof took 8m

We could make the overall timeout configurable (maybe just across all RPC calls for simplicity)? Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be looking at what it takes for a block from at least 14 and maybe even 30 days ago to be honest. 6 days is definitely too short given the air gap will mean the anchor state becomes a minimum of 7 days old. You can imagine if there's an issue with proposals that means we need to to intervene, worst case is that we intervene at the 7 day mark, but then we fix the issue and it will take another 7 days for a correct proposal to finalise and update the anchor state, so you wind up with it being 14 days old.

Probably is worth making it configurable, though we'll then need to add an option to op-challenger so that it can pass through a config option to op-program. I suspect in practice something will need to be done to improve reth's performance here though or it opens up an avenue for an attacker to overwhelm the honest actor's resources by making them compute expensive proofs from old blocks.

Comment on lines +30 to +31
// whether to use the experimental source
experimentalEnabled bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd suggest removing this field and just having:

func (l *L2Source) experimentalEnabled() bool {
  return l.experimentalClient != nil
}

That way the data structure can't be configured wrong, but we still get a nice readable if l.experimentalEnabled()

Comment on lines +127 to +128
l.logger.Error("Experimental source is not enabled, cannot fetch execution witness", "blockNum", blockNum)
panic("experimental source is not enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this should just return an ErrNotEnabled rather than logging and panicing. Looking at #12559 we'll always get the new hint whether this is enabled or not, and the fetcher will have to handle not having witness support enabled.

I know that's a little different to my previous comment about the caller not calling this at all, but I think it probably works out cleaner to just check for a specific error rather than having to pass a witnessEnabled bool through to the prefetcher. The prefetcher would just return a nil error if it got ErrNotEnabled back without storing anything and the fallback to debug_dbGet would automatically kick in as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants