Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Right now if a worker process dies (e.g. due to a network partition, host failure, OOM error) we have to wait until the overall job timeout is hit before the job is marked as failed/lost. This is problematic when we have long running jobs with large timeouts, because it could mean waiting hours for a job to be restarted.
Adding proper worker heartbeats is a standard way to solve this (e.g. in rq, faktory, and others) but that's tricky in rjq because 'workers' aren't such a first class concept - they're not registered or monitored. In the absence of that, I've implemented a simple version which just reduces the expiry time of the job in Redis significantly, then resets it on every heartbeat until the job is completed, failed, or times out, in which case the expiry time is updated to expiry as before.
The bad news here is that if a worker does die unexpectedly, the key in Redis is simply lost, rather than remaining and being marked as LOST. I'm not sure there's a way around that without adding a more concrete 'worker' abstraction (which might be worth doing, but will increase complexity).