You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes I'll need to wait a few minute for my job to start (using the very awesome feature of remotely starting jobs from a batch submission system). If the kernel dies, upon restart a brand new job is submitted. This results in two jobs sitting on the queue.
So far, I've only had a problem on PBS systems.
Best,
Scott
The text was updated successfully, but these errors were encountered:
Original comment by Tom Daff (Bitbucket: tdaff, GitHub: tdaff).
Hi Scott,
Thanks for the report, and I'm happy that you are still finding the code useful :)
I'm still thinking how to deal with this. I think the main issue is that the kernel is run in a subprocess and upon restart the subprocess gets killed completely and a new one starts. The original PBS job probably lingers around until it times out. Does the job go away eventually by itself (maybe 10 mins)?
Are you wondering whether it is possible to re-use a job for subsequent kernels? That might be possible, but would need significant re-engineering to persist an active connection between different python processes. Though I know it is annoying when jobs take a while in the queue.
Original comment by Scott Field (Bitbucket: sfield83, ).
Hi Tom,
I've never allowed the rogue job to linger too long, so I'm not really sure. But would the queuing system even become alerted to the fact that the kernel has died? The job might just sit on the queue until it starts running, and then run to completion (ie nothing happens until the requested wall time is exhausted).
Anyway, its a very minor issue. It would be very nice to re-use the job for the restarted kernel. Or, to do a full cleanup, if the main process could call the system's qdel command immediately after the kernel subprocess is killed.
Original report by Scott Field (Bitbucket: sfield83, ).
Hi Tom,
Sometimes I'll need to wait a few minute for my job to start (using the very awesome feature of remotely starting jobs from a batch submission system). If the kernel dies, upon restart a brand new job is submitted. This results in two jobs sitting on the queue.
So far, I've only had a problem on PBS systems.
Best,
Scott
The text was updated successfully, but these errors were encountered: