Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circusd unresponsive when cmd is wrong and max_retry = -1 #1157

Open
jeroenhe opened this issue Mar 10, 2021 · 2 comments
Open

Circusd unresponsive when cmd is wrong and max_retry = -1 #1157

jeroenhe opened this issue Mar 10, 2021 · 2 comments

Comments

@jeroenhe
Copy link

Problem description

In short: circusd version 0.17.1 becomes unresponsive when a cmd is wrongly defined inside the applications configuration and also has max_retry set to -1.

What does happen
When starting circusd with the above situation present:

  • circusd starts outputting warnings at an alarming high rate: circus[1] [WARNING] error in 'app3': [Errno 2] No such file or directory: '/srv/mine/app3-broken': '/srv/mine/app3-broken'
  • circusd becomes unresponsive to queries from circusctl
  • other configured applications that where not yet started, won't get started

What I'd like to happen instead

  • circusd stays responsive to queries from circusctl
  • circusd keeps trying to start my app with the watcher, but with a small configurable delay, so it doesn't get overloaded.

What about settings max_retry to something else than -1?
A possible workaround could be to set max_retry set to (say) 5. This will stop the problem of circusd becoming unresponsive, but it will also cause the process to not be started when it would have started normally (given a correct cmd) and stopped 5 times in a row for other reasons.

Reproducing the issue
I have created a proof of concept for the issue so it can be easily reproduced. Instructions on running it are in the README. I hope this helps.

@MFlossmann
Copy link
Contributor

Apart from the general question of “Should circus be ‘taken hostage’ by faulty commands?”: Would the Flapping Plugin solve your issue?

@jeroenhe
Copy link
Author

Apart from the general question of “Should circus be ‘taken hostage’ by faulty commands?”: Would the Flapping Plugin solve your issue?

Thank you for your reply.

In my proof of concept I have already made use of the flapping plugin, but this doesn't solve the issue of circusd becoming unavailable. The config related to the flapping plugin looks like this:

[plugin:flapping]
use = circus.plugins.flapping.Flapping
# the number of times a process can restart, within window seconds, before we consider it flapping (default: 2)
attempts = 2
# the time window in seconds to test for flapping. If the process restarts more than attempts times within this time window, we consider it a flapping process. (default: 1)
window = 60
# the number of times we attempt to start a process that has been flapping, before we abandon and stop the whole watcher. (default: 5) Set to -1 to disable max_retry and retry indefinitely.
max_retry = -1
# time in seconds to wait until we try to start again a process that has been flapping. (default: 7)
retry_in = 7

If there is something I can change in this configuration to prevent circusd from becoming unresponsive, please tell me :)

I understand your "should not be taken hostage" feeling, but for us it does happen. For example, we have lots of java applications, but sometimes we accidentally deploy applications to a server that doesn't have the correct referenced java runtime yet, which then causes mayhem for all circusd managed processes on the server. It's hard to prevent such mistakes, and they are mostly easily and quickly repaired, but it adds up that none of the "other" circusd managed services are managed any more (and circusd requires a restart, not a reload).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants