Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solver participation guard #3257

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Solver participation guard #3257

wants to merge 18 commits into from

Conversation

squadgazzz
Copy link
Contributor

@squadgazzz squadgazzz commented Jan 29, 2025

Description

From the original issue:

When a solver repeatedly wins consecutive auctions but fails to settle its solutions on-chain, it can lead to system downtime. To prevent this, the autopilot must have the capability to temporarily exclude such solvers from participating in competitions. This ensures no single solver can disrupt the system's operations.

This PR implements it by introducing a new struct, which checks whether the solver is allowed to participate in the next competition by using two different approaches:

  1. Moved the existing Authenticator's is_solver on-chain call into the new struct.
  2. Introduced a new strategy, which finds a non-settling solver using a SQL query. It selects 3 last auctions(configurable) with a deadline until the current block to avoid selecting pending settlements and checks if all of the auctions were settled by the same solver/solvers(in case of multiple winners). This strategy caches the results to avoid redundant DB queries. This query relies on the auction_id column from the settlements table, which gets updated separately by the Observer struct, so the cache gets updated only once the Observer has some result.

These validators are called sequentially to avoid redundant RPC calls to Authenticator. So it first checks for the DB-based validator cache and, only then, sends the RPC call.

Once one of the strategies says the solver is not allowed to participate, it gets deny-listed for 5m(configurable).

Each validator can be enabled/disabled separately in case of any issue.

Metrics

Added a metric that gets populated by the DB-based validator once a solver is marked as banned. The idea is to create an alert that is sent if there are more than 4 such occurrences for the last 30 minutes for the same solver, meaning it should be considered disabling the solver.

Open discussions

  1. Since the current SQL query filters out auctions where a deadline has not been reached, the following case is possible:
    The solver gets banned, while the same solver has a pending settlement. In case this gets settled, the solver remains banned. While this is a niche case, it would be better to unblock the solver before the cache TTL deadline is reached. This has not been implemented in the current PR since some refactoring is required in the Observer struct. If this is approved, it can be implemented quickly.

  2. Whether it makes sense to introduce a metrics-based strategy similar to the bad token detector's where the solver gets banned in case >95%(or similar) of settlements fail.

How to test

A new SQL query test. Existing e2e tests.

Related Issues

Fixes #3221

@squadgazzz squadgazzz changed the title Solver participation validator Solver participation gate Jan 29, 2025
@squadgazzz squadgazzz changed the title Solver participation gate Solver participation guard Jan 29, 2025
@squadgazzz squadgazzz marked this pull request as ready for review January 29, 2025 17:00
@squadgazzz squadgazzz requested a review from a team as a code owner January 29, 2025 17:00
@squadgazzz squadgazzz marked this pull request as draft January 29, 2025 17:01
@squadgazzz squadgazzz marked this pull request as ready for review January 29, 2025 18:00
Comment on lines +143 to +144
/// Finds solvers that won `last_auctions_count` consecutive auctions but
/// never settled any of them. The current block is used to prevent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the problem is not exclusive to solvers that win repeatedly. It's worse for the protocol when a failing solver wins repeatedly but conceptually the check should protect any malfunctioning solver from participating in the auction. If it only wins 10% of the cases but fails to settle a majority of them we should also disable that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it only wins 10% of the cases but fails to settle a majority of them we should also disable that one.

Ok, that solves one of the discussion points from the PR description. I'd introduce it in a separate PR then since the current one is already too big.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think the metrics-based approach needs to be implemented in addition to the current SQL query, where the query quickly prevents protocol from being stuck(the original issue states only this problem), while the metrics-based is more about long-term estimation. Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the metrics-based approach needs to be implemented in addition to the current SQL query

I think detecting that X% of won and promised solutions of solver S didn't get onchain over the last M minutes should be doable using the DB, no?

I'd introduce it in a separate PR then since the current one is already too big.

Makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last M minutes

If only last M auctions, not minutes. Will test the query since it seems like it would require a much higher auctions range to fetch to build reasonable statistics.

?err,
"failed to check if solver is deny listed"
);
let can_participate = self.solver_participation_guard.can_participate(&driver.submission_address).await.map_err(|err| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This participation guard needs to be opt-in until there was a CIP enforcing that for every solver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this can be achieved by disabling the db-based validator in the config and using this option by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it should be opt-in on a solver by solver basis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +745 to +746
// Do not send the request to the driver if the solver is deny-listed
if !can_participate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either notify the driver on every missed auction or when it gets disabled so that the external team can debug what's wrong immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing the /notify endpoint is external's team responsibly, right? So we would just send a request without expecting all the external solvers implemented it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An attempt to implement it in a separate PR #3262

crates/autopilot/src/arguments.rs Outdated Show resolved Hide resolved
@squadgazzz squadgazzz force-pushed the blacklist-failing-solvers branch 2 times, most recently from f69e174 to 5fc831e Compare January 30, 2025 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Kick out solver from competition if not settling
2 participants