Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead kernel submits new job to queue on restart #12

Open
tdaff opened this issue Dec 3, 2015 · 2 comments
Open

Dead kernel submits new job to queue on restart #12

tdaff opened this issue Dec 3, 2015 · 2 comments

Comments

@tdaff
Copy link
Owner

tdaff commented Dec 3, 2015

Original report by Scott Field (Bitbucket: sfield83, ).


Hi Tom,

Sometimes I'll need to wait a few minute for my job to start (using the very awesome feature of remotely starting jobs from a batch submission system). If the kernel dies, upon restart a brand new job is submitted. This results in two jobs sitting on the queue.

So far, I've only had a problem on PBS systems.

Best,
Scott

@tdaff
Copy link
Owner Author

tdaff commented Dec 11, 2015

Original comment by Tom Daff (Bitbucket: tdaff, GitHub: tdaff).


Hi Scott,

Thanks for the report, and I'm happy that you are still finding the code useful :)

I'm still thinking how to deal with this. I think the main issue is that the kernel is run in a subprocess and upon restart the subprocess gets killed completely and a new one starts. The original PBS job probably lingers around until it times out. Does the job go away eventually by itself (maybe 10 mins)?

Are you wondering whether it is possible to re-use a job for subsequent kernels? That might be possible, but would need significant re-engineering to persist an active connection between different python processes. Though I know it is annoying when jobs take a while in the queue.

@tdaff
Copy link
Owner Author

tdaff commented Dec 11, 2015

Original comment by Scott Field (Bitbucket: sfield83, ).


Hi Tom,

I've never allowed the rogue job to linger too long, so I'm not really sure. But would the queuing system even become alerted to the fact that the kernel has died? The job might just sit on the queue until it starts running, and then run to completion (ie nothing happens until the requested wall time is exhausted).

Anyway, its a very minor issue. It would be very nice to re-use the job for the restarted kernel. Or, to do a full cleanup, if the main process could call the system's qdel command immediately after the kernel subprocess is killed.

Scott

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant