Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate retrying of failed nodes by default. #13692

Open
isubasinghe opened this issue Oct 2, 2024 · 8 comments
Open

Deprecate retrying of failed nodes by default. #13692

isubasinghe opened this issue Oct 2, 2024 · 8 comments
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries type/feature Feature request type/tech-debt

Comments

@isubasinghe
Copy link
Member

Summary

Currently a user might have a continueOn: failed or depends: $TASKNAME.Failed and they might consider this node to have succeeded from their point of view (In the sense that the failure is no big deal or that it was expected).
Currently the retry logic attempts to reset these nodes, which is an introduction of policy that is forced upon the user.

We should in the very least be able to opt out of this behaviour.

Use Cases

This allows one to precisely retry a single node without worrying about a failed node also retrying.

@isubasinghe isubasinghe added type/feature Feature request type/tech-debt area/server and removed type/feature Feature request labels Oct 2, 2024
@agilgur5 agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries type/feature Feature request and removed area/server labels Oct 2, 2024
@agilgur5
Copy link
Member

agilgur5 commented Oct 2, 2024

Deprecate retrying of failed nodes by default.

Currently a user might have a continueOn: failed or depends: $TASKNAME.Failed and they might consider this node to have succeeded

A little confused by the title vs description here -- do you mean all failed nodes or only ones using the aforementioned features?

The "all" scenario raises the question of "then what should the default behavior of retry be?"

This allows one to precisely retry a single node without worrying about a failed node also retrying.

This is already possible with --node-field-selector

@isubasinghe
Copy link
Member Author

A little confused by the title vs description here -- do you mean all failed nodes or only ones using the aforementioned features?

The current retry code attempts to retry all failed nodes.
This happens even when you specify with node-field-selector a completely different node.

@agilgur5
Copy link
Member

agilgur5 commented Oct 10, 2024

--node-field-selector not working correctly is a bug, not a feature 😅

If that's the main thing you were referencing, this sounds like a duplicate of #12543. You actually reviewed the PR for it there back in Feb when I asked someone with more knowledge of the retry logic to take a look, particularly because of potential breakage

@isubasinghe
Copy link
Member Author

Ah yes I remember this issue now, I don't think the existing behaviour is a bug by looking at the code anyway.
It seems very intentional that the author intended the failed nodes to be retried.

I rewrote the retry logic from scratch (because it had the wrong approach) and kept this behaviour.

I think regardless of whether its a bug or not doesn't matter, this retry behaviour has existed since 2017-2018 and effectively is a feature now.

I suggest we keep the existing beahviour in $NEW_VERSION and deprecate it, then remove it entirely in $NEW_VERSION + 1.

@agilgur5
Copy link
Member

agilgur5 commented Oct 10, 2024

I rewrote the retry logic from scratch (because it had the wrong approach)

The author illuminated this to me in #12553 (comment) too and asked about a refactor in #12553 (comment).

Well that and the other bugs with it made me take a look and get really confused by the approach too

It seems very intentional that the author intended the failed nodes to be retried.

It seemed like --node-field-selector was only intended to be used with --restart-successful per your read, the diagrams linked above, and the original PR: #2431

I agree with other folks that the current behavior feels pretty confusing though.

I think regardless of whether its a bug or not doesn't matter, this retry behaviour has existed since 2017-2018 and effectively is a feature now.

I suggest we keep the existing beahviour in $NEW_VERSION and deprecate it, then remove it entirely in $NEW_VERSION + 1.

Yea that's more or less what I suggested in #12553 (comment) but minus a deprecation; i.e. a breaking fix in the same spirit as #11005

@agilgur5
Copy link
Member

I would say we should close this out as duplicate of #12543 and make a breaking fix to --node-field-selector if that's the main thing you were referencing (you didn't mention that in the issue description, so I'm not sure)

@isubasinghe
Copy link
Member Author

I disagree with a breaking fix here, we should do a deprecation.

I am sure there are some users that have now grown used to the retry button as a way to restart all failed nodes.

@agilgur5
Copy link
Member

agilgur5 commented Oct 10, 2024

To be clear, the breaking fix I'm advocating for would only impact --node-field-selector, so it would be quite similar to #11005 in that respect as well: in both cases, the breakage is only when a certain option is used, making that option's behavior "saner"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries type/feature Feature request type/tech-debt
Projects
None yet
Development

No branches or pull requests

2 participants