-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: refactor alerts to accomodate for single-node clusters #1010
base: master
Are you sure you want to change the base?
bugfix: refactor alerts to accomodate for single-node clusters #1010
Conversation
For the sake of brevity, let: Q: kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"} (allocable), and, QQ: namespace_cpu:kube_pod_container_resource_requests:sum{} (requested), thus, both quota alerts relevant here exist in the form: sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0, which, in case of a single-node cluster (sum(Q) by (cluster) = max(Q) by (cluster)), is reduced to, sum(QQ) by (cluster) > 0, i.e., the alert will fire if *any* request limits exist. To address this, drop the "max(Q) by (cluster)" buffer assumed in non-SNO clusters from SNO, reducing the expression to: sum(QQ) by (cluster) - sum(Q) by (cluster) > 0 (total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense. Signed-off-by: Pranshu Srivastava <[email protected]>
5b96fb5
to
5cd53d6
Compare
and | ||
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0 | ||
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - | ||
sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0) | |
0.95 * sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0) |
Since a max(Q)
buffer is not applicable in SNO, how about a numeric buffer of 5% (or more?)? That should help alert before things go out of budget.
@@ -34,18 +34,34 @@ | |||
} + | |||
if $._config.showMultiCluster then { | |||
expr: ||| | |||
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - (sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0 | |||
(count(kube_node_info) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If showMultiCluster
is true
that implies the cluster
label is available, so the check here should probably use the cluster
label (so that each cluster is checked on whether it has a single node).
Additionally, I suggest a de-dupe for multiple KSM using max
like so:
(count(kube_node_info) == 1 | |
(count by (cluster) (max by (cluster, node) (kube_node_info)) == 1 |
@@ -34,18 +34,34 @@ | |||
} + | |||
if $._config.showMultiCluster then { | |||
expr: ||| | |||
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - (sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0 | |||
(count(kube_node_info) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't (count(kube_node_info) == 1 and ...
mess up the returned value (e.g. it would always return 1)? Given the complexity of the expression, I'd advocate for some unit tests in the first place to assert the current rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some unit tests around this, but I'm not sure why this will always return 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foo == 1 and bar
will always return 1 (the right-hand side is only used for label matching).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(count by (%(clusterLabel)s) (max by (%(clusterLabel)s, node) (kube_node_info)) == 1 and
sum by (%(clusterLabel)s) (namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) -
sum by (%(clusterLabel)s) (kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) > 0)
Apologies if I'm missing something here, but this seems to have boolean expressions on both sides, similar to, say, vector(1) == 1 and vector(2) - vector(1) > 0
(scalar vector on RHS of and
instead of an instant vector, so no label matching)?
If this was the aforementioned case, I would've preferred the bool
operator to avoid the default filtering behavior, but it seemed to suffice without that.
For the sake of brevity, let:
thus, both quota alert expressions relevant here (
KubeCPUOvercommit
andKubeMemoryOvercommit
) exist in the form:sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0
, which, in case of a single-node cluster (sum(Q) by (cluster)
=max(Q) by (cluster)
), is reduced to,sum(QQ) by (cluster) > 0
, i.e., the alert will fire if any request limits exist.To address this, drop the
max(Q) by (cluster)
buffer assumed in non-SNO clusters from SNO, reducing the expression to:sum(QQ) by (cluster) - sum(Q) by (cluster) > 0
(total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense.