-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solver participation guard #3257
base: main
Are you sure you want to change the base?
Conversation
/// Finds solvers that won `last_auctions_count` consecutive auctions but | ||
/// never settled any of them. The current block is used to prevent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the problem is not exclusive to solvers that win repeatedly. It's worse for the protocol when a failing solver wins repeatedly but conceptually the check should protect any malfunctioning solver from participating in the auction. If it only wins 10% of the cases but fails to settle a majority of them we should also disable that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it only wins 10% of the cases but fails to settle a majority of them we should also disable that one.
Ok, that solves one of the discussion points from the PR description. I'd introduce it in a separate PR then since the current one is already too big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think the metrics-based approach needs to be implemented in addition to the current SQL query, where the query quickly prevents protocol from being stuck(the original issue states only this problem), while the metrics-based is more about long-term estimation. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the metrics-based approach needs to be implemented in addition to the current SQL query
I think detecting that X%
of won and promised solutions of solver S
didn't get onchain over the last M
minutes should be doable using the DB, no?
I'd introduce it in a separate PR then since the current one is already too big.
Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last
M
minutes
If only last M
auctions, not minutes. Will test the query since it seems like it would require a much higher auctions range to fetch to build reasonable statistics.
crates/autopilot/src/domain/competition/solver_participation_guard.rs
Outdated
Show resolved
Hide resolved
?err, | ||
"failed to check if solver is deny listed" | ||
); | ||
let can_participate = self.solver_participation_guard.can_participate(&driver.submission_address).await.map_err(|err| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This participation guard needs to be opt-in until there was a CIP enforcing that for every solver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this can be achieved by disabling the db-based validator in the config and using this option by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant it should be opt-in on a solver by solver basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// Do not send the request to the driver if the solver is deny-listed | ||
if !can_participate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should either notify the driver on every missed auction or when it gets disabled so that the external team can debug what's wrong immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing the /notify
endpoint is external's team responsibly, right? So we would just send a request without expecting all the external solvers implemented it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An attempt to implement it in a separate PR #3262
f69e174
to
5fc831e
Compare
Description
From the original issue:
This PR implements it by introducing a new struct, which checks whether the solver is allowed to participate in the next competition by using two different approaches:
Authenticator
'sis_solver
on-chain call into the new struct.Observer
struct, so the cache gets updated only once theObserver
has some result.These validators are called sequentially to avoid redundant RPC calls to
Authenticator
. So it first checks for the DB-based validator cache and, only then, sends the RPC call.Once one of the strategies says the solver is not allowed to participate, it gets deny-listed for 5m(configurable).
Each validator can be enabled/disabled separately in case of any issue.
Metrics
Added a metric that gets populated by the DB-based validator once a solver is marked as banned. The idea is to create an alert that is sent if there are more than 4 such occurrences for the last 30 minutes for the same solver, meaning it should be considered disabling the solver.
Open discussions
Since the current SQL query filters out auctions where a deadline has not been reached, the following case is possible:
The solver gets banned, while the same solver has a pending settlement. In case this gets settled, the solver remains banned. While this is a niche case, it would be better to unblock the solver before the cache TTL deadline is reached. This has not been implemented in the current PR since some refactoring is required in the Observer struct. If this is approved, it can be implemented quickly.
Whether it makes sense to introduce a metrics-based strategy similar to the bad token detector's where the solver gets banned in case >95%(or similar) of settlements fail.
How to test
A new SQL query test. Existing e2e tests.
Related Issues
Fixes #3221