Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to add/enable collections of optional monitors #53

Open
mffiedler opened this issue May 14, 2020 · 4 comments
Open

Ability to add/enable collections of optional monitors #53

mffiedler opened this issue May 14, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@mffiedler
Copy link
Collaborator

mffiedler commented May 14, 2020

This might be stretching the original intent of Cerberus, but I see a trend. As we add additional checks, the patterns we could follow are a) make the new check a default and always run it b) give the new check an option in the config or c) introduce the idea of collections of optional checks - or maybe just one optional collection of "verbose health checks" for simplicity.

There are a lot of detailed things that could be monitored on a cluster - whether Cerberus should monitor them is open for discussion (issue #42 ). As new checks are added the monitor loop time grows at least linearly with the number of monitored namespaces and higher when pod checks are included (PR #52 ).

For discussion, should we identify a core set of critical checks and enable some mechanism for optional/verbose checks without adding a config flag for everyone of them?

/cc: @paigerube14 @chaitanyaenr @yashashreesuresh

@chaitanyaenr
Copy link
Collaborator

chaitanyaenr commented May 14, 2020

Adding a single verbose checks option sounds good but think they should just be part of logs/warnings instead of being considered for setting the go/no-go signal IMHO. For example, checking if the master nodes are marked as unscheduled or not should just be logged as info as the user might intentionally mark them as schedulable depending on the need. Any check which is taken into account for setting the go/no-go signal should be exposed as an option to the user in order to be able to disable it in case there's a know problem and the user is fine with ignoring it.

As we add more checks, the monitor time is going to increase especially on a large scale cluster like @mffiedler mentioned. We might want to take a look at making Cerberus checks concurrent - #23.

Thoughts?

@paigerube14
Copy link
Collaborator

I agree with Naga Ravi, I think that the verbose checks should just be able to log information about the current specific states of the cluster. I think that this will be enough helpful information for the user to verify their certain checkpoints or be able to narrow down what went wrong.
I think it might be nice to have all the options that are for the go/no-go signal set in the config file so you know that these are the possible options that I really care about. But then all of your own verbose extra checks could be passed in through command line options. Thoughts?

I definitely think that as we add more checks and options that we are going to need the Cerberus checks to be concurrent.

@yashashreesuresh
Copy link
Contributor

yashashreesuresh commented May 19, 2020

I would go with the idea of adding one optional collection of "verbose health checks". As there can be a lot of detailed things that could be monitored on a cluster, all the checks which are not taken into account for setting the go/no-go signal can be placed under “verbose health checks” by default. The user can select the checks according to his needs. For example, it becomes redundant to check if the master nodes are marked as unscheduled in every iteration. There might be things which needn’t be monitored always and things which needn't be monitored in every iteration as it increases the monitor loop time.

@chaitanyaenr
Copy link
Collaborator

Think we are all in agreement as per the discussion on slack. The idea is to add a way in Cerberus to be able to run user provided checks ( bring your own checks ) and consider/not-consider them when setting the go/no-go signal based on the requirement of the user. This should accommodate the verbose/optional checks as well provided the output of the checks is in a format understandable by Cerberus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants