-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce client circuit breaker #115
base: master
Are you sure you want to change the base?
Conversation
5fabecb
to
15631bf
Compare
text/0115-circuit-breaker.md
Outdated
# Title | ||
|
||
- RFC PR: https://github.com/tikv/rfcs/pull/115 | ||
- Tracking Issue: https://github.com/tikv/repo/issues/0000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use this issue: tikv/pd#8678
Signed-off-by: artem_danilov <[email protected]>
300ef37
to
eef6b58
Compare
|
||
```go | ||
type Settings struct { | ||
Type string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding more details for Type
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a type of circuit breaker e.g. "pd_get_regions", "pd_tso", "tikv_copr". Maybe we can call it a Name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Type
and Name
are ok to me and we can add some explanations in RFC.
* Default: 10 | ||
* Unit: seconds | ||
--- | ||
* `min_qps_to_open` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a config for it or is it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "had only 1 request within error_rate_window", you do not have to enable the cb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually circuit breaker tracks the number of failures within error window. I found it problematic to set absolute value given that it could vary a lot depending on qps hence I chose to go with error rate instead. But in this case we do need a protection from low qps cases or time of the day hence I propose minQPS. If we want to limit number of easy configurations to one session variables, then maybe we should consider to set it in absolute number of errors within window.
* Unit: integer | ||
--- | ||
* `cooldown_interval` | ||
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. This interval always equally jittered in the `[value/2, value]` interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I miss something? Why is it a range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a range. I just meant that the cooldown time will be jittered and with the actual time falling in the interval. Same as we jitter backoff time. Let me capture it more clearly here.
|
||
#### System variables | ||
|
||
`tidb_cb_pd_metadata_error_rate_threshold_pct`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems
pd_metadata
is difficult to understanderror_rate
is redundant with pct
How about renaming it to pd_client_cb_error_rate_threshold
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls. pd_client_get_regions_cb_error_rate_threshold
?
pct meant to inidicate that the value is integer representing percent and not float representing the ratio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, the conclusion of the last meeting was that a single switch should control all interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rleungx Thanks for the review. I've asked some clarifying questions.
* Unit: integer | ||
--- | ||
* `cooldown_interval` | ||
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. This interval always equally jittered in the `[value/2, value]` interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a range. I just meant that the cooldown time will be jittered and with the actual time falling in the interval. Same as we jitter backoff time. Let me capture it more clearly here.
* Default: 10 | ||
* Unit: seconds | ||
--- | ||
* `min_qps_to_open` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit.
|
||
#### System variables | ||
|
||
`tidb_cb_pd_metadata_error_rate_threshold_pct`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls. pd_client_get_regions_cb_error_rate_threshold
?
pct meant to inidicate that the value is integer representing percent and not float representing the ratio.
|
||
```go | ||
type Settings struct { | ||
Type string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a type of circuit breaker e.g. "pd_get_regions", "pd_tso", "tikv_copr". Maybe we can call it a Name
?
All configs described above will be encapsulated in a `Settings` struct with ability to change error rate threshold dynamically: | ||
|
||
```go | ||
type Settings struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this configuration be exported for the user? And what is the relationship between this and 'system variables'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error rate threshold will be a system variable, and everything else is hardcoded or exposed in not documented toml client section.
func (cb *CircuitBreaker[T]) Execute(run func() (T, error, boolean)) (T, error) | ||
``` | ||
|
||
There is a third boolean parameter in the function argument above, in case the provided function doesn't return an error, but the caller still wants to treat it as a failure for the circuit breaker counting purposes (e.g. empty or corrupted result). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can run a test where only the gRPC and timeout errors are treated as actual errors, which might be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is up to the place where we wire the circuit breaker in. Boolean param allows to configure it. It make sense to start with timeouts, but having ability to tweak it in the API let us tune it further.
RFC for tikv/pd#8678
Rendered view: https://github.com/Tema/rfcs/blob/0115-circuit-breaker/text/0115-circuit-breaker.md