op-program: add experimental L2 source flag to enable execution witnesses #12558

meyer9 · 2024-10-22T02:22:55Z

Description

In order to start testing the replacement of debug_dbGet with debug_executionWitness, we need to allow op-program to be configured to use the experimental source without affecting current use cases of the tool.

Because the EL client that will be running debug_executionWitness may be running less stable code, we should allow specifying a separate experimental RPC URL which will fall back to the canonical URL if it fails.

Adds --l2.experimental.enabled=<true|false> flag to op-program which enables an experimental client implementation.
Adds --l2.experimental=<url> flag to specify the RPC URL of the client to use.

Tests

No tests because nothing functionally changes with this PR.

Additional context

Here are some questions I have for context:

I increased the call timeout for RPC - should we make this configurable?
Should L2 experimental client URL be consolidated into canonical URL? Separation of the clients in code makes auditing easy.

Next PR will include implementation for this flag.

ajsutton · 2024-10-22T02:52:32Z

op-program/host/config/config.go

+	// L2URL is the URL of the L2 node to fetch L2 data from, this is the canonical URL for L2 data
+	// in the case of L2ExperimentalEnabled = true, this URL is used as a fallback if the experimental URL fails or cannot retrieve the desired data
+	L2URL string
+	// L2ExperimentalURL is the URL of the L2 node (non hash db archival node, for example, reth archival node) to fetch L2 data from
+	L2ExperimentalURL string


I'd just use a single L2URL. The fallback is kind of cool but I think we're just adding unnecessary complexity. The L2ExperimentalEnabled flag is useful so we can disable using witnesses entirely.

Then we can first enable witnesses on our vm-runner box that just runs cannon/op-program repeatedly outside of dispute games first. That can be pointed at an experimental version of op-geth if needed and it will build up confidence that it all works properly. Then we can actually enable it for the real op-challenger.

Curious what does vm-runner do exactly? How did you set it up?

We don't have something like this, and the current reason for having this experimental URL is to allow it to run on devnet / testnets so that we can use real data to test if it works.

Still think this might be helpful though? As

This does not change existing code path, you can just configure L2URL and go with the original route

it could help add more testing surface (having it run on testnet or even mainnet without worrying about breaking things)

And it's relatively easy (when we have confidence) to remove it and fallback to single client configuration

Kept this in for now, but should be easy to remove if we decide to remove it.

Sorry I wrote an answer to this yesterday and somehow managed to not post it.

vm-runner is actually the op-challenger run-trace subcommand which can run the trace provider op-challenger uses over and over based on real chain data. We have a couple of boxes in the cloud that run it pointed at op-sepolia and op-mainnet to verify various different configurations correctly validate the chain. So we run cannon with the op-program prestate actually specified in the FaultDisputeGame on chain, cannon with an updated op-program prestate similar to what will be deployed with holocene (though holocene not yet active), a multithreaded cannon prestate and asterisc/kona.

There's two problems with trying to deploy this to production networks:

You need to update the prestate used on chain so it will emit the new hints. That requires a governance vote to approved the changes.

cannon never actually runs in production because dispute games only get down to the split depth when we deliberately force it to (and spend a lot of ETH in the process).

So I definitely like the idea of a feature toggle to control whether or not we call executionWitness but I don't actually see the need for a separate L2 URL.

I see, our case is a little bit different as we want to use reth as the experimental back and geth as the fallback due to geth state growth issues we're facing. Basically our disk size is projected to exceed 30T in Q1 next year and after that point it's gonna be super hard to run a geth hashdb archival node.

That's why we need a different url to provide experimental apis (by reth)

Also

You need to update the prestate used on chain so it will emit the new hints. That requires a governance vote to approved the changes.

From Dockerfile.repro, it seems like any commits to op-program could potentially change the reproducible-prestate? Curious what actually changes the absolute prestate?

cannon never actually runs in production because dispute games only get down to the split depth when we deliberately force it to (and spend a lot of ETH in the process).

This is expected as it's very expensive, is there a good way to test this feature change? My understanding is that cannon execution into specific steps does not change anything as the block we execute would get all the requried input from execution witness and eth_getProof and have them cached in the oracle, no?

I see, our case is a little bit different as we want to use reth as the experimental back and geth as the fallback

Ah yeah, that a good reason to justify the extra code. Makes sense.

From Dockerfile.repro, it seems like any commits to op-program could potentially change the reproducible-prestate? Curious what actually changes the absolute prestate?

Yes, any changes to the op-program client will result in a different prestate generated from it. Typically I assume that to get the exact same prestate you have to be on the same commit. But to actually use a different prestate on chain, that new version needs to be set as the absolute prestate in the FaultDisputeGame contract on L1. That requires the ProxyAdminOwner to sign a transaction and should only be done for governance approved versions.

This is expected as it's very expensive, is there a good way to test this feature change? My understanding is that cannon execution into specific steps does not change anything as the block we execute would get all the requried input from execution witness and eth_getProof and have them cached in the oracle, no?

Because there's a new hint type emitted from the client, this will require a new prestate so will change the specific steps that cannon executes. We can switch back and forth between how we fetch the data without any issues once that extra hint is present, it's just adding the hint that changes the actual execution. We'll need to update the prestate with Holocene to support that hard fork anyway so it's pretty likely that emitting the new hint can ship with that (and then we can decide when to depend on executionWitness to actually fetch the data independently).

So the way I'd test this is with the run-trace subcommand. We can easily deploy a new prestate there that emits the hint and still have it verify the actual L2 blocks. It won't produce the same final cannon state as the on-chain version but it should still correctly determine if the proposed output root is valid or not. We can also monitor the logs for any cases where it needs to fall back to debug_dbGet.

ajsutton · 2024-10-22T02:53:29Z

op-program/host/l2_experimental_client.go

+	panic("unsupported")
+}
+
+func (s *L2ExperimentalClient) ExecutionWitness(ctx context.Context, blockNum uint64) (*eth.ExecutionWitness, error) {


I think we can just build this into the normal L2Client rather than needing to extend. It's just up to the caller whether they use this or not.

Fixed here: db28421

ajsutton

Looks good generally. Just some fairly minor simplifications I think we can make and would be good to have tests for the new CLI flags.

ajsutton · 2024-10-24T00:35:12Z

op-program/host/flags/flags.go

+	L2NodeExperimentalAddr = &cli.StringFlag{
+		Name:    "l2.experimental",
+		Usage:   "Address of L2 JSON-RPC endpoint to use for experimental features (debug_executionWitness)",
+		EnvVars: prefixEnvVars("L2_RPC_EXPERIMENTAL_RPC"),
+	}
+	L2NodeExperimentalEnabled = &cli.BoolFlag{
+		Name:    "l2.experimental.enabled",
+		Usage:   "Enable experimental features on the L2 JSON-RPC endpoint, will fallback to L2NodeAddr if not able to retrieve desired data",
+		EnvVars: prefixEnvVars("L2_RPC_EXPERIMENTAL_ENABLED"),
+	}


Could we add some tests in main_test.go for these new flags please? It's amazing how often those unit tests have found problems with flags...

Actually, since L2NodeExperimentalAddr is required if enabled is true, we should probably just remove l2.experimental.enabled and enable the experimental features if the experimental adds is supplied. Less opportunity to create invalid config that way.

ajsutton · 2024-10-24T00:45:05Z

op-program/host/l2_source.go

+func NewL2Source(ctx context.Context, logger log.Logger, config *config.Config) (*L2Source, error) {
+	logger.Info("Connecting to canonical L2 source", "url", config.L2URL)
+
+	// eth_getProof calls are expensive and takes time, so we use a longer timeout


hmm, this is only true for reth, geth can respond to eth_getProof quickly. btw, it would be worth testing how op-reth handles a eth_getProof request for a quite old block. Typically the anchor state is about 3.5 days old currently, but that will likely become 7 days with some improvements coming to only update it after the air gap so it's more reliable. And it could be longer than that if proposals haven't happened for a while. I suspect op-reth will have some work to do to optimise getProof if it's going to be the only source of data.

Yep this is what we found (with a 2 second block time):

block 12 hrs ago: execution witness took 41s, account proof took 1m9s

block 24 hrs ago: execution witness took 1m48s, account proof took 4m

block 6 days ago: execution witness took 3m33s, account proof took 8m

We could make the overall timeout configurable (maybe just across all RPC calls for simplicity)? Wdyt?

I'd be looking at what it takes for a block from at least 14 and maybe even 30 days ago to be honest. 6 days is definitely too short given the air gap will mean the anchor state becomes a minimum of 7 days old. You can imagine if there's an issue with proposals that means we need to to intervene, worst case is that we intervene at the 7 day mark, but then we fix the issue and it will take another 7 days for a correct proposal to finalise and update the anchor state, so you wind up with it being 14 days old.

Probably is worth making it configurable, though we'll then need to add an option to op-challenger so that it can pass through a config option to op-program. I suspect in practice something will need to be done to improve reth's performance here though or it opens up an avenue for an attacker to overwhelm the honest actor's resources by making them compute expensive proofs from old blocks.

ajsutton · 2024-10-24T00:52:30Z

op-program/host/l2_source.go

+	// whether to use the experimental source
+	experimentalEnabled bool


nit: I'd suggest removing this field and just having:

func (l *L2Source) experimentalEnabled() bool { return l.experimentalClient != nil }

That way the data structure can't be configured wrong, but we still get a nice readable if l.experimentalEnabled()

ajsutton · 2024-10-24T00:59:55Z

op-program/host/l2_source.go

+		l.logger.Error("Experimental source is not enabled, cannot fetch execution witness", "blockNum", blockNum)
+		panic("experimental source is not enabled")


I suspect this should just return an ErrNotEnabled rather than logging and panicing. Looking at #12559 we'll always get the new hint whether this is enabled or not, and the fetcher will have to handle not having witness support enabled.

I know that's a little different to my previous comment about the caller not calling this at all, but I think it probably works out cleaner to just check for a specific error rather than having to pass a witnessEnabled bool through to the prefetcher. The prefetcher would just return a nil error if it got ErrNotEnabled back without storing anything and the fallback to debug_dbGet would automatically kick in as needed.

meyer9 mentioned this pull request Oct 22, 2024

op-program: add execution witness hints to remove dbGet dependency #12559

Open

ajsutton reviewed Oct 22, 2024

View reviewed changes

meyer9 force-pushed the meyer9/feat/l2-experimental-client-support branch from 7bd0cd0 to d2926cb Compare October 23, 2024 16:47

meyer9 added 2 commits October 23, 2024 09:48

feat: add experimental L2 source flag

d6aa063

Move L2 experimental client methods to eth client

db28421

meyer9 force-pushed the meyer9/feat/l2-experimental-client-support branch from d2926cb to db28421 Compare October 23, 2024 16:49

Fix missing return

7413d80

meyer9 marked this pull request as ready for review October 23, 2024 16:54

meyer9 requested a review from ajsutton October 23, 2024 16:58

ajsutton reviewed Oct 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

op-program: add experimental L2 source flag to enable execution witnesses #12558

op-program: add experimental L2 source flag to enable execution witnesses #12558

meyer9 commented Oct 22, 2024 •

edited

Loading

ajsutton Oct 22, 2024

0x00101010 Oct 22, 2024

meyer9 Oct 23, 2024

ajsutton Oct 23, 2024

0x00101010 Oct 24, 2024 •

edited

Loading

ajsutton Oct 24, 2024

ajsutton Oct 22, 2024

meyer9 Oct 23, 2024

ajsutton left a comment

ajsutton Oct 24, 2024

ajsutton Oct 24, 2024

ajsutton Oct 24, 2024

meyer9 Oct 24, 2024

ajsutton Oct 24, 2024

ajsutton Oct 24, 2024

ajsutton Oct 24, 2024

		// whether to use the experimental source
		experimentalEnabled bool

		l.logger.Error("Experimental source is not enabled, cannot fetch execution witness", "blockNum", blockNum)
		panic("experimental source is not enabled")

op-program: add experimental L2 source flag to enable execution witnesses #12558

Are you sure you want to change the base?

op-program: add experimental L2 source flag to enable execution witnesses #12558

Conversation

meyer9 commented Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0x00101010 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajsutton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meyer9 commented Oct 22, 2024 •

edited

Loading

0x00101010 Oct 24, 2024 •

edited

Loading