Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config to automatically Re-trigger failed periodics #358

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jmguzik
Copy link
Contributor

@jmguzik jmguzik commented Jan 14, 2025

This PR introduces a new configuration field (along with the functionality in horologium component) for ProwJobs to support automatically re-triggering periodic jobs based on a specified policy. The new retrigger-failed-run field allows retrying a failed periodic job with customizable settings for the number of attempts and the interval between retries.

Example usage:

retrigger-failed-run:
  until_success: true
  attempts: 3
  interval: 6h

until_success: true alows to stop if success state is reached. Set to false it triggers until attempts reached.
sloves #268

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jmguzik
Once this PR has been reviewed and has the lgtm label, please assign cjwagner for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 14, 2025
Copy link

netlify bot commented Jan 14, 2025

Deploy Preview for k8s-prow ready!

Name Link
🔨 Latest commit 62d9f69
🔍 Latest deploy log https://app.netlify.com/sites/k8s-prow/deploys/6793d1a1acbb9d000884fc23
😎 Deploy Preview https://deploy-preview-358--k8s-prow.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@jmguzik jmguzik force-pushed the re-trigger branch 3 times, most recently from 8d1e0bd to de145fc Compare January 14, 2025 23:10
cmd/horologium/main.go Outdated Show resolved Hide resolved
@Prucek
Copy link
Member

Prucek commented Jan 15, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@petr-muller
Copy link
Contributor

/cc

@petr-muller
Copy link
Contributor

/hold

Please provide a PR description of what the PR actually implements - the link to the issue is helpful but not sufficient, because even the issue contains some discussion about options for a vague idea, not a clear implementation direction.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2025
@jmguzik
Copy link
Contributor Author

jmguzik commented Jan 16, 2025

@petr-muller done

@jmguzik
Copy link
Contributor Author

jmguzik commented Jan 16, 2025

/hold cancel
then?

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2025
@petr-muller
Copy link
Contributor

I'll review tomorrow 🙏

Copy link
Contributor

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we need to be intentional about is the interaction between the next legit trigger and a retry, when both could occur, and also whether it is acceptable for retries to "push back" interval triggers.

As done here, next legit trigger takes precedence, basically interrupting the retry sequences, which may impact especially the stop on success = false case where we expect to deterministically get the configured amount of retries but we wont because the legit-triggerred job interrupts that.

That sounds acceptable to me but needs to be intentional, documented behavior.

pkg/config/jobs.go Outdated Show resolved Hide resolved
pkg/config/jobs.go Outdated Show resolved Hide resolved
pkg/config/jobs.go Show resolved Hide resolved
cmd/horologium/main_test.go Outdated Show resolved Hide resolved
cmd/horologium/main_test.go Show resolved Hide resolved
cmd/horologium/main.go Show resolved Hide resolved
cmd/horologium/main.go Outdated Show resolved Hide resolved
pkg/kube/prowjob.go Outdated Show resolved Hide resolved
pkg/config/jobs.go Outdated Show resolved Hide resolved
@jmguzik
Copy link
Contributor Author

jmguzik commented Jan 20, 2025

@petr-muller thanks for the review. I agree with most of your comments but I will clear them when we reach common understanding.

As noted in the comments, every job marked with the new fields (in the configuration) will have associated label. First run will have value 1. This is because I wanted to keep other periodics unaffected and I make distinction on this field weather to process periodic as normal or repeatable.

One thing we need to be intentional about is the interaction between the next legit trigger and a retry, when both could occur, and also whether it is acceptable for retries to "push back" interval triggers.

I think the way it is implemented, normal trigger (in the case of this implementation first run value 1) will always have precedence because as soon as shouldTrigger is detected to be true, the code will not follow new section.

As done here, next legit trigger takes precedence, basically interrupting the retry sequences, which may impact especially the stop on success = false case where we expect to deterministically get the configured amount of retries but we wont because the legit-triggerred job interrupts that.

That's true

That sounds acceptable to me but needs to be intentional, documented behavior.

Makes sense.

cmd/horologium/main.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants