add a method to easily re-run a job until success #6564

grondo · 2025-01-19T18:06:46Z

As mentioned in #6555, a common use case is for a user to submit a job and have a way to rerun the same jobspec until the job succeeds. One method is to chain a series of afternotok jobs, but there should be a better way, perhaps wrapped up in a --resubmit or similar option?

The instance has the signed J, could a jobtap plugin somehow resubmit this on the user's behalf?

One potential benefit to the user of using the chain of afternotok jobs is that the submit time of the dependent jobs match the initial submission. This may give the jobs a priority bump relative to submitting a job at the time of failure.

I don't have any answers currently, just jotting down some thoughts.

The text was updated successfully, but these errors were encountered:

garlick · 2025-01-19T19:15:28Z

Seems like we'd want to qualify that with "runs until it doesn't get a fatal exception" or something like that?

I wonder what would break if a jobtap plugin backdated the resubmission in order to not lose out on any priority boost from aging?

grondo · 2025-01-19T19:41:22Z

Seems like we'd want to qualify that with "runs until it doesn't get a fatal exception" or something like that?

Ah, perhaps something to clarify with the user. I assumed they'd want to resubmit on nonzero exit status as well, but perhaps that isn't a correct assumption. (maybe a flag to do one or the other based on user choice)

Edit: afternotok, which they're using now, does not differentiate between a fatal exception vs nonzero exit status from the job.

garlick · 2025-01-19T23:18:22Z

I guess I was just thinking probably the main reasons to resubmit were things in the environment that went wrong, like a bad node, and that running the same exact thing again was likely to produce the same failure otherwise. But yeah, perhaps something to clarify with the user. Maybe could support either (afternotok vs afterexcepti?)

grondo · 2025-01-19T23:39:02Z

I guess I was just thinking probably the main reasons to resubmit were things in the environment that went wrong, like a bad node, and that running the same exact thing again was likely to produce the same failure otherwise.

Oh sorry, I got that and should have said as much 🤦

One way to handle this now would be to submit the job as a batch job with a script that runs a job in a loop until success. This would require allocating a few extra nodes in case of node failure, so there's a drawback there, with the benefit that the job retry can occur immediately up until the batch job hits a time limit.

One thing we don't handle well now is that if enough nodes go down in a batch instance that the job can't run, it will be stuck in SCHED until the timelimit, instead of aborting the script in some way (I thought an issue was open on this but I can't find it)

Anyway, not sure if that is the best solution long-term anyway. (It also doesn't handle resubmit after timeout obviously)

grondo mentioned this issue Jan 21, 2025

support afterexc dependency scheme #6566

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a method to easily re-run a job until success #6564

add a method to easily re-run a job until success #6564

grondo commented Jan 19, 2025

garlick commented Jan 19, 2025

grondo commented Jan 19, 2025 •

edited

Loading

garlick commented Jan 19, 2025

grondo commented Jan 19, 2025 •

edited

Loading

add a method to easily re-run a job until success #6564

add a method to easily re-run a job until success #6564

Comments

grondo commented Jan 19, 2025

garlick commented Jan 19, 2025

grondo commented Jan 19, 2025 • edited Loading

garlick commented Jan 19, 2025

grondo commented Jan 19, 2025 • edited Loading

grondo commented Jan 19, 2025 •

edited

Loading

grondo commented Jan 19, 2025 •

edited

Loading