You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In short: circusd version 0.17.1 becomes unresponsive when a cmd is wrongly defined inside the applications configuration and also has max_retry set to -1.
What does happen
When starting circusd with the above situation present:
circusd starts outputting warnings at an alarming high rate: circus[1] [WARNING] error in 'app3': [Errno 2] No such file or directory: '/srv/mine/app3-broken': '/srv/mine/app3-broken'
circusd becomes unresponsive to queries from circusctl
other configured applications that where not yet started, won't get started
What I'd like to happen instead
circusd stays responsive to queries from circusctl
circusd keeps trying to start my app with the watcher, but with a small configurable delay, so it doesn't get overloaded.
What about settings max_retry to something else than -1?
A possible workaround could be to set max_retry set to (say) 5. This will stop the problem of circusd becoming unresponsive, but it will also cause the process to not be started when it would have started normally (given a correct cmd) and stopped 5 times in a row for other reasons.
Reproducing the issue
I have created a proof of concept for the issue so it can be easily reproduced. Instructions on running it are in the README. I hope this helps.
The text was updated successfully, but these errors were encountered:
Apart from the general question of “Should circus be ‘taken hostage’ by faulty commands?”: Would the Flapping Plugin solve your issue?
Thank you for your reply.
In my proof of concept I have already made use of the flapping plugin, but this doesn't solve the issue of circusd becoming unavailable. The config related to the flapping plugin looks like this:
[plugin:flapping]
use = circus.plugins.flapping.Flapping
# the number of times a process can restart, within window seconds, before we consider it flapping (default: 2)
attempts = 2
# the time window in seconds to test for flapping. If the process restarts more than attempts times within this time window, we consider it a flapping process. (default: 1)
window = 60
# the number of times we attempt to start a process that has been flapping, before we abandon and stop the whole watcher. (default: 5) Set to -1 to disable max_retry and retry indefinitely.
max_retry = -1
# time in seconds to wait until we try to start again a process that has been flapping. (default: 7)
retry_in = 7
If there is something I can change in this configuration to prevent circusd from becoming unresponsive, please tell me :)
I understand your "should not be taken hostage" feeling, but for us it does happen. For example, we have lots of java applications, but sometimes we accidentally deploy applications to a server that doesn't have the correct referenced java runtime yet, which then causes mayhem for all circusd managed processes on the server. It's hard to prevent such mistakes, and they are mostly easily and quickly repaired, but it adds up that none of the "other" circusd managed services are managed any more (and circusd requires a restart, not a reload).
Problem description
In short: circusd version
0.17.1
becomes unresponsive when acmd
is wrongly defined inside the applications configuration and also hasmax_retry
set to-1
.What does happen
When starting circusd with the above situation present:
circus[1] [WARNING] error in 'app3': [Errno 2] No such file or directory: '/srv/mine/app3-broken': '/srv/mine/app3-broken'
What I'd like to happen instead
What about settings max_retry to something else than
-1
?A possible workaround could be to set
max_retry
set to (say)5
. This will stop the problem of circusd becoming unresponsive, but it will also cause the process to not be started when it would have started normally (given a correctcmd
) and stopped 5 times in a row for other reasons.Reproducing the issue
I have created a proof of concept for the issue so it can be easily reproduced. Instructions on running it are in the README. I hope this helps.
The text was updated successfully, but these errors were encountered: