Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a method to easily re-run a job until success #6564

Open
grondo opened this issue Jan 19, 2025 · 4 comments
Open

add a method to easily re-run a job until success #6564

grondo opened this issue Jan 19, 2025 · 4 comments

Comments

@grondo
Copy link
Contributor

grondo commented Jan 19, 2025

As mentioned in #6555, a common use case is for a user to submit a job and have a way to rerun the same jobspec until the job succeeds. One method is to chain a series of afternotok jobs, but there should be a better way, perhaps wrapped up in a --resubmit or similar option?

The instance has the signed J, could a jobtap plugin somehow resubmit this on the user's behalf?

One potential benefit to the user of using the chain of afternotok jobs is that the submit time of the dependent jobs match the initial submission. This may give the jobs a priority bump relative to submitting a job at the time of failure.

I don't have any answers currently, just jotting down some thoughts.

@garlick
Copy link
Member

garlick commented Jan 19, 2025

Seems like we'd want to qualify that with "runs until it doesn't get a fatal exception" or something like that?

I wonder what would break if a jobtap plugin backdated the resubmission in order to not lose out on any priority boost from aging?

@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2025

Seems like we'd want to qualify that with "runs until it doesn't get a fatal exception" or something like that?

Ah, perhaps something to clarify with the user. I assumed they'd want to resubmit on nonzero exit status as well, but perhaps that isn't a correct assumption. (maybe a flag to do one or the other based on user choice)

Edit: afternotok, which they're using now, does not differentiate between a fatal exception vs nonzero exit status from the job.

@garlick
Copy link
Member

garlick commented Jan 19, 2025

I guess I was just thinking probably the main reasons to resubmit were things in the environment that went wrong, like a bad node, and that running the same exact thing again was likely to produce the same failure otherwise. But yeah, perhaps something to clarify with the user. Maybe could support either (afternotok vs afterexcepti?)

@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2025

I guess I was just thinking probably the main reasons to resubmit were things in the environment that went wrong, like a bad node, and that running the same exact thing again was likely to produce the same failure otherwise.

Oh sorry, I got that and should have said as much 🤦

One way to handle this now would be to submit the job as a batch job with a script that runs a job in a loop until success. This would require allocating a few extra nodes in case of node failure, so there's a drawback there, with the benefit that the job retry can occur immediately up until the batch job hits a time limit.

One thing we don't handle well now is that if enough nodes go down in a batch instance that the job can't run, it will be stuck in SCHED until the timelimit, instead of aborting the script in some way (I thought an issue was open on this but I can't find it)

Anyway, not sure if that is the best solution long-term anyway. (It also doesn't handle resubmit after timeout obviously)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants