Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable error handling and retrying in HTTP target #330

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

pondzix
Copy link
Contributor

@pondzix pondzix commented Jun 12, 2024

Jira ref: PDP-1203

Drafting:

  • Flexible configuration allowing us to categorize HTTP responses and match specific retrying strategies.
  • Very simple application health store as global variable.
  • Very simple monitoring interface used to send alerts during retries in HTTP target.

It doesn't compile now! It's just to show how such configuration could look like and validate such approach.

In general the goal is to make default error handling and retries similar to what we have in our new streaming loaders.

Drafting:

* Flexible configuration allowing us to categorize HTTP responses and
  match specific retrying strategies.
* Very simple application health store as global variable.
* Very simple monitoring interface used to send alerts during retries in
  HTTP target.

It doesn't compile now! It's just to show how such configuration could
look like and validate such approach.
target {
use "http" {

// by default no custom invalid/retryable rules so we use current default retrying strategy for target failures in `sourceWriteFunc` (in cli.go)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that we have retries in the target, but also retain the retry that wraps it in the soruceWriteFunc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was initial idea. Don't touch sourceWriteFunc and treat it as fallback. Then I know I don't break anything for other targets.

But now looking at other targets (e.g. for kinesis)....clients that we use there already have retrying implemented under the hood, correct? So as we have now HTTP target covered with retries, do we need this another layer of retrying in sourceWriteFunc at all and now simply make it explicit that retrying is responsibility of a target.

That would be similar to what we have in loaders.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we will make sure we don't break anything for other targets regardless :D

My initial assumption about what we'd do here was that it should replace the way we currently do retries, but I hadn't really considered what you're suggesting here.

Thinking about it now, my initial thoughts are that having a retry wrapped in a retry makes it more complicated to follow - and one of the things I'm keen for us to improve about Snowbridge is that it's quite complicated to support and there are lots of little details like thisthat are hard to follow.

I'm not sure about the kinesis implementation but the EventHubs target is a great example of this - there's a MaxAutoRetries config option. This is the max retry setting for the EH client itself. But that is also wrapped in a retry - the API is very confusing here. One would reasonably expect that setting to dictate how many retries are made in total.

But it's not, the actual max number of retries in total is 5x that setting. Why 5x? Because that's what we hardcoded into the app long before we built that target. How do we expect a user, or on of our own support team to understand this when there's some issue? Well 🤷 - we do have it in the docs I believe but good design is when the app does what someone would reasonably expect by intuition.

My gut says that there's a nice design to be found to make retries the responsibility of the target, like how you've described it. So if it's possible one outcome of this spike would be to flesh that out and get a grip on how hard it is to do (and how we can do it with confidence that we don't break things). :)


//not only http status but check if response body matches. Extract some part of body (e.g. error code) and check if it's equal/not equal to value.
body {
path: "some jq here??"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting way to solve the problem. Might well work. Might be suitable for other targets too - worth figuring that out.

]

// For these retryable errors (assuming we have health check):
// - set app status as unhealthy for each failed retry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the streaming loaders have this concept of being set as unhealthy after a failed retry? Are there very few retries?

Would like to understand/think through this bit together. We have an app here that makes thousands of requests per second, and depending on the target, 10-15% of them might get rejected and retried.

If we make it configurable this might not be a problem but would like to understand it better regardless. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the streaming loaders have this concept of being set as unhealthy after a failed retry? Are there very few retries?

Yes, kind of. But in loaders we do that for synchronous actions like init table, create columns, open channel. If it's transient error or setup error, we mark app as unhealthy.

Here I think it's a bit different. We can have a lot of concurrent writes to HTTP target, right? So we may end up in situation like you described - 15% of request are retried (some transient issue) and they mark app as unhealthy. The rest 85% are OK and they set it back to healthy.

Sooo we may have a situation when we flip back and forth health toggle and we are unlucky with Kubernetes periodic liveness probe when it sees only unhealthy state....and kills the pod, even though overall everything looks pretty good?

It would make sense to make it unhealthy only for those persistent setup-like issues. I could add configuration option like toggle_health to retry strategy and set to our default setup one as true. Or instead of boolean use threshold, like unhealthy_after with number of retries after which we mark app as unhealthy....?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think it's a bit different. We can have a lot of concurrent writes to HTTP target, right? So we may end up in situation like you described - 15% of request are retried (some transient issue) and they mark app as unhealthy. The rest 85% are OK and they set it back to healthy.

Pretty much - what's more we can expect to sometimes be in that state for a long time (eg. when being rate limited on a spike of traffic), and we actually rely on getting failures in those scenarios. So something feels somewhat odd about considering individual failures to mean 'unhealthy' status.

Sooo we may have a situation when we flip back and forth health toggle and we are unlucky with Kubernetes periodic liveness probe when it sees only unhealthy state....and kills the pod, even though overall everything looks pretty good?

I'm not sure how much of a real concern this scenario is, becuase we'd likely have many more successes than failures. But even still something feels odd about the design in this scenario.

It would make sense to make it unhealthy only for those persistent setup-like issues. I could add configuration option like toggle_health to retry strategy and set to our default setup one as true. Or instead of boolean use threshold, like unhealthy_after with number of retries after which we mark app as unhealthy....?

If we're only concerned with setup issues, then one option is a global flag to mark whether any request has been successful would work - on any success we'd mark it as true, on failure we wouldn't mark it as false but would mark the app as unhealthy.

But I'm not sure about this - would you mind jotting down a quick synopsis of this scenario? (maybe add a details section to the existing doc - but anywhere is fine)

I feel like getting a grip on the desired behaviour for this type of app might lead us to a more confident solution. :)

// - if we run out of attempts => exit/crash/restart
// - if we don't have max attempts (e.g. like setup errors), we don't get stuck retrying indefinitely, because we are protected by health check. If app is unhealthy for extended period time, it will be killed eventually by kubernetes.
retryable = [
{ http_status: ["503"], strategy: "transient"},
Copy link
Collaborator

@colmsnowplow colmsnowplow Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'strategy' mean in this config? It's 'type of error to treat this as'. Got it.

@@ -49,6 +84,9 @@ type HTTPTargetConfig struct {
OAuth2ClientSecret string `hcl:"oauth2_client_secret,optional" env:"TARGET_HTTP_OAUTH2_CLIENT_SECRET"`
OAuth2RefreshToken string `hcl:"oauth2_refresh_token,optional" env:"TARGET_HTTP_OAUTH2_REFRESH_TOKEN"`
OAuth2TokenURL string `hcl:"oauth2_token_url,optional" env:"TARGET_HTTP_OAUTH2_TOKEN_URL"`

//we have flat config structure everywhere so not sure if it's good idea to add struct here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection on principle. As long as the config itself is legible.

return nil
}

if ht.isInvalid(response) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is very interesting. Configuring the flow of data around the app as well. Nice idea

}

func (ht *HTTPTarget) retry(config RetryConfig, action func() (interface{}, error)) interface{} {
// Should we use some third-party retrying library? Where we can configure more stuff.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we find a well supported library that does it, let's look at that - especially if it can simplify our implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also have the option of adding what we need to this package

@colmsnowplow
Copy link
Collaborator

This is great @pondzix!

I left some thoughts - broadly love the ideas & direction here. Two things I'm thinking about on this design:

  • Can we use this or similar mechanism for all the targets, and replace the retry in sourceWriteFunc? (and what does design look like, if we can't there's implications etc).

  • Can marking as unhealthy work similarly to loaders for our use case? (maybe for some cases and not others)

I'm not sure I have a firm opinion on the second yet, just that I would be keen to understand it and think it through. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants