Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce client circuit breaker #115

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Tema
Copy link

@Tema Tema commented Oct 31, 2024

@Tema
Copy link
Author

Tema commented Oct 31, 2024

@rleungx @niubell PTAL

@Tema
Copy link
Author

Tema commented Oct 31, 2024

cc: @mittalrishabh @HaoW30

# Title

- RFC PR: https://github.com/tikv/rfcs/pull/115
- Tracking Issue: https://github.com/tikv/repo/issues/0000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use this issue: tikv/pd#8678

Signed-off-by: artem_danilov <[email protected]>
@rleungx
Copy link
Member

rleungx commented Nov 5, 2024

@nolouch @okJiang Please also take a look.


```go
type Settings struct {
Type string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding more details for Type here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a type of circuit breaker e.g. "pd_get_regions", "pd_tso", "tikv_copr". Maybe we can call it a Name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Type and Name are ok to me and we can add some explanations in RFC.

* Default: 10
* Unit: seconds
---
* `min_qps_to_open`
Copy link
Member

@rleungx rleungx Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a config for it or is it necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "had only 1 request within error_rate_window", you do not have to enable the cb?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually circuit breaker tracks the number of failures within error window. I found it problematic to set absolute value given that it could vary a lot depending on qps hence I chose to go with error rate instead. But in this case we do need a protection from low qps cases or time of the day hence I propose minQPS. If we want to limit number of easy configurations to one session variables, then maybe we should consider to set it in absolute number of errors within window.

* Unit: integer
---
* `cooldown_interval`
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. This interval always equally jittered in the `[value/2, value]` interval.
Copy link
Member

@okJiang okJiang Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I miss something? Why is it a range?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a range. I just meant that the cooldown time will be jittered and with the actual time falling in the interval. Same as we jitter backoff time. Let me capture it more clearly here.


#### System variables

`tidb_cb_pd_metadata_error_rate_threshold_pct`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems

  1. pd_metadata is difficult to understand
  2. error_rate is redundant with pct

How about renaming it to pd_client_cb_error_rate_threshold?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls. pd_client_get_regions_cb_error_rate_threshold?

pct meant to inidicate that the value is integer representing percent and not float representing the ratio.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, the conclusion of the last meeting was that a single switch should control all interfaces.

Copy link
Author

@Tema Tema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rleungx Thanks for the review. I've asked some clarifying questions.

* Unit: integer
---
* `cooldown_interval`
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. This interval always equally jittered in the `[value/2, value]` interval.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a range. I just meant that the cooldown time will be jittered and with the actual time falling in the interval. Same as we jitter backoff time. Let me capture it more clearly here.

* Default: 10
* Unit: seconds
---
* `min_qps_to_open`
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit.


#### System variables

`tidb_cb_pd_metadata_error_rate_threshold_pct`:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls. pd_client_get_regions_cb_error_rate_threshold?

pct meant to inidicate that the value is integer representing percent and not float representing the ratio.


```go
type Settings struct {
Type string
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a type of circuit breaker e.g. "pd_get_regions", "pd_tso", "tikv_copr". Maybe we can call it a Name?

All configs described above will be encapsulated in a `Settings` struct with ability to change error rate threshold dynamically:

```go
type Settings struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this configuration be exported for the user? And what is the relationship between this and 'system variables'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error rate threshold will be a system variable, and everything else is hardcoded or exposed in not documented toml client section.

func (cb *CircuitBreaker[T]) Execute(run func() (T, error, boolean)) (T, error)
```

There is a third boolean parameter in the function argument above, in case the provided function doesn't return an error, but the caller still wants to treat it as a failure for the circuit breaker counting purposes (e.g. empty or corrupted result).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can run a test where only the gRPC and timeout errors are treated as actual errors, which might be sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is up to the place where we wire the circuit breaker in. Boolean param allows to configure it. It make sense to start with timeouts, but having ability to tweak it in the API let us tune it further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants