Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add worker heartbeats #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

WIP: Add worker heartbeats #2

wants to merge 1 commit into from

Conversation

sd2k
Copy link

@sd2k sd2k commented Nov 30, 2018

Right now if a worker process dies (e.g. due to a network partition, host failure, OOM error) we have to wait until the overall job timeout is hit before the job is marked as failed/lost. This is problematic when we have long running jobs with large timeouts, because it could mean waiting hours for a job to be restarted.

Adding proper worker heartbeats is a standard way to solve this (e.g. in rq, faktory, and others) but that's tricky in rjq because 'workers' aren't such a first class concept - they're not registered or monitored. In the absence of that, I've implemented a simple version which just reduces the expiry time of the job in Redis significantly, then resets it on every heartbeat until the job is completed, failed, or times out, in which case the expiry time is updated to expiry as before.

The bad news here is that if a worker does die unexpectedly, the key in Redis is simply lost, rather than remaining and being marked as LOST. I'm not sure there's a way around that without adding a more concrete 'worker' abstraction (which might be worth doing, but will increase complexity).

@sd2k sd2k requested a review from lbolla November 30, 2018 09:51
@sd2k
Copy link
Author

sd2k commented Nov 30, 2018

I've marked this as WIP because it needs tests and might not be the best implementation.

@lbolla
Copy link

lbolla commented Nov 30, 2018

It looks good to me. I don't have a deep enough knowledge of rjq to know if this is the best implementation, though. I am happy to give it a try and see how it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants