WIP: Add worker heartbeats #2

sd2k · 2018-11-30T09:51:23Z

Right now if a worker process dies (e.g. due to a network partition, host failure, OOM error) we have to wait until the overall job timeout is hit before the job is marked as failed/lost. This is problematic when we have long running jobs with large timeouts, because it could mean waiting hours for a job to be restarted.

Adding proper worker heartbeats is a standard way to solve this (e.g. in rq, faktory, and others) but that's tricky in rjq because 'workers' aren't such a first class concept - they're not registered or monitored. In the absence of that, I've implemented a simple version which just reduces the expiry time of the job in Redis significantly, then resets it on every heartbeat until the job is completed, failed, or times out, in which case the expiry time is updated to expiry as before.

The bad news here is that if a worker does die unexpectedly, the key in Redis is simply lost, rather than remaining and being marked as LOST. I'm not sure there's a way around that without adding a more concrete 'worker' abstraction (which might be worth doing, but will increase complexity).

sd2k · 2018-11-30T09:52:01Z

I've marked this as WIP because it needs tests and might not be the best implementation.

lbolla · 2018-11-30T11:38:08Z

It looks good to me. I don't have a deep enough knowledge of rjq to know if this is the best implementation, though. I am happy to give it a try and see how it works.

Send heartbeats to Redis to show we're still working on a job

6b745c5

sd2k assigned lbolla Nov 30, 2018

sd2k requested a review from lbolla November 30, 2018 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add worker heartbeats #2

WIP: Add worker heartbeats #2

sd2k commented Nov 30, 2018 •

edited

Loading

sd2k commented Nov 30, 2018

lbolla commented Nov 30, 2018

WIP: Add worker heartbeats #2

Are you sure you want to change the base?

WIP: Add worker heartbeats #2

Conversation

sd2k commented Nov 30, 2018 • edited Loading

sd2k commented Nov 30, 2018

lbolla commented Nov 30, 2018

sd2k commented Nov 30, 2018 •

edited

Loading