-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat (monitoring): [alerts] enable new recommended experience for aks clusters #435
feat (monitoring): [alerts] enable new recommended experience for aks clusters #435
Conversation
Pod level alert: at least one Job instance did not complete successfully for the last 6 hours.
Pod level alert: The average CPU usage per container exceeds 95% for the last 5 minutes.
…erAverageMemoryHigh Pod level alert: The average memory usage per container exceeds 95% for the last 5 minutes
Pod level alert: One or more pods is in a failed state for the last 5 minutes
Platform level alert Node cpu percentage is replacing this
no replacement available
Node level alert: A node has been unreachable for the last 15 minutes
Platform level alert Node memory working set percentage is greater than 100% is replacing this
…edCount Cluster level alert: One or more containers within pods have been killed due to out-of-memory (OOM) events for the last 5 minutes
Pod level alert: The average usage of Persistent Volumes (PVs) on pod exceeds 80% for the last 15 minutes
Pod level alert: The percentage of pods in a ready state falls below 80% for any deployment or daemonset in the Kubernetes cluster for the last 5 minutes
…rRestart Pod level alert: One or more containers within pods in the Kubernetes cluster have been restarted at least once within the last hour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ferantivero thanks for this! I can see a ton of work here - thanks so much. The changes all make sense.
We should ensure we link to the Recommended alert rules for Kubernetes clusters in the reference architecture too, to make sure it's clear where these came from.
#sign-off please let's consider merging this once we landed the desired changes at the RA level |
cluster-manifests/kube-system/container-azm-ms-agentconfig.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one suggestion, take or leave.
#sign-off |
Going to assume that you'll pass on my suggestion by the sign off signal. If you do change your mind on that, maybe you can add it (back in and commented out) in the other PR you've got. I'll merge now. |
Thanks for merging this @chad, please let's take a look at this comment #435 (comment) |
Understood, then for sure my suggestion was terrible. Carry on :) |
WHY?
We do wanted to switch from legacy/retired container insights metric alert rules to the new recommend Prometheus metric alert rules
WHAT Changed?
delete alertable metrics configuration settings from ConfigMap
replace legacy Container CPU % CI alert w/ KubeContainerAverageCPUHigh
replace legacy Container working set memory % CI alert w/ KubeContain…
replace legacy Failed Pod counts CI alert w/ KubePodFailedState
replace legacy Node NotReady status CI alert w/ KubeNodeUnreachable
replace legacy OOM Killed Containers CI alert w/ KubeContainerOOMKill…
replace legacy Persistent Volume Usage % CI alert w/ KubePVUsageHigh
replace legacy Pods ready % CI alert w/ KubePodReadyStateLow
replace legacy Restarting container count CI alert w/ KubePodContaine…
add extra recommended Prometheus Pod level metric alert rules
add extra recommended Prometheus Node level metric alert rules
add extra recommended Prometheus Cluster level metric alert rules
disabling legacy Node CPU % CI alert
disabling legacy Node Disk Usage % CI alert
disabling legacy Node working set memory % CI alert
Test
Alerts
closes: #319599