Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat (monitoring): [alerts] enable new recommended experience for aks…
… clusters (#435) * enable new recommended alert rules * replace legacy Completed job count CI alert w/ KubeJobStale Pod level alert: at least one Job instance did not complete successfully for the last 6 hours. * replace legacy Container CPU % CI alert w/ KubeContainerAverageCPUHigh Pod level alert: The average CPU usage per container exceeds 95% for the last 5 minutes. * replace legacy Container working set memory % CI alert w/ KubeContainerAverageMemoryHigh Pod level alert: The average memory usage per container exceeds 95% for the last 5 minutes * replace legacy Failed Pod counts CI alert w/ KubePodFailedState Pod level alert: One or more pods is in a failed state for the last 5 minutes * disabling legacy Node CPU % CI alert Platform level alert Node cpu percentage is replacing this * disabling legacy Node Disk Usage % CI alert no replacement available * replace legacy Node NotReady status CI alert w/ KubeNodeUnreachable Node level alert: A node has been unreachable for the last 15 minutes * disabling legacy Node working set memory % CI alert Platform level alert Node memory working set percentage is greater than 100% is replacing this * replace legacy OOM Killed Containers CI alert w/ KubeContainerOOMKilledCount Cluster level alert: One or more containers within pods have been killed due to out-of-memory (OOM) events for the last 5 minutes * replace legacy Persistent Volume Usage % CI alert w/ KubePVUsageHigh Pod level alert: The average usage of Persistent Volumes (PVs) on pod exceeds 80% for the last 15 minutes * replace legacy Pods ready % CI alert w/ KubePodReadyStateLow Pod level alert: The percentage of pods in a ready state falls below 80% for any deployment or daemonset in the Kubernetes cluster for the last 5 minutes * replace legacy Restarting container count CI alert w/ KubePodContainerRestart Pod level alert: One or more containers within pods in the Kubernetes cluster have been restarted at least once within the last hour * add extra recommended Prometheus Pod level metric alert rules * add extra recommended Prometheus Node level metric alert rules * add extra recommended Prometheus Cluster level metric alert rules * remove unsed module * comment out configmap CI alert configuration * Address PR Feedback: remove legacy alterting configuration instead of commenting out
- Loading branch information