-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a method to easily re-run a job until success #6564
Comments
Seems like we'd want to qualify that with "runs until it doesn't get a fatal exception" or something like that? I wonder what would break if a jobtap plugin backdated the resubmission in order to not lose out on any priority boost from aging? |
Ah, perhaps something to clarify with the user. I assumed they'd want to resubmit on nonzero exit status as well, but perhaps that isn't a correct assumption. (maybe a flag to do one or the other based on user choice) Edit: |
I guess I was just thinking probably the main reasons to resubmit were things in the environment that went wrong, like a bad node, and that running the same exact thing again was likely to produce the same failure otherwise. But yeah, perhaps something to clarify with the user. Maybe could support either (afternotok vs afterexcepti?) |
Oh sorry, I got that and should have said as much 🤦 One way to handle this now would be to submit the job as a batch job with a script that runs a job in a loop until success. This would require allocating a few extra nodes in case of node failure, so there's a drawback there, with the benefit that the job retry can occur immediately up until the batch job hits a time limit. One thing we don't handle well now is that if enough nodes go down in a batch instance that the job can't run, it will be stuck in SCHED until the timelimit, instead of aborting the script in some way (I thought an issue was open on this but I can't find it) Anyway, not sure if that is the best solution long-term anyway. (It also doesn't handle resubmit after timeout obviously) |
As mentioned in #6555, a common use case is for a user to submit a job and have a way to rerun the same jobspec until the job succeeds. One method is to chain a series of
afternotok
jobs, but there should be a better way, perhaps wrapped up in a--resubmit
or similar option?The instance has the signed J, could a jobtap plugin somehow resubmit this on the user's behalf?
One potential benefit to the user of using the chain of
afternotok
jobs is that the submit time of the dependent jobs match the initial submission. This may give the jobs a priority bump relative to submitting a job at the time of failure.I don't have any answers currently, just jotting down some thoughts.
The text was updated successfully, but these errors were encountered: