Grafana Agent components unhealthy because of k8s API server timeout during pod startup #7053
Labels
bug
Something isn't working
needs-attention
An issue or PR has been sitting around and needs attention.
What's wrong?
We are running Grafana-agent statefulset pods with autoscaling enabled (Horizontal pod autoscaling enabled) in our Kubernetes cluster .
While launching new grafana-agent pod during Horizontal autoscaling, sometimes we noticed grafana agent components namely prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes going unhealthy because of API server timeout error during initialisation of pod. After this , Grafana-agent pod continue to stays in running state and showing these 3 components unhealthy without retrying to make connection to API server.
Here is how it looks in Grafana-agent UI
On checking, we did not notice any issue directly with k8s API server and also ,other grafana-agent pods (which were part of same statefulset )were running healthy with all of their components (prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes and others )showing in healthy state .While this newly launched statefulset pod started by HPA for Grafana-agent statefulset continue showing these components unhealthy as soon as it starts .
The issue occurs sometimes and not every time . but it is reported by prometheus metric expression :
sum (agent_component_controller_running_components{health_type!="healthy"}) > 0
What is expected:
Grafana-agent in such cases can continue retrying for making connection to API server on discovering timeout for prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes components during initialization to retry to make its component healthy so that situation gets resolved without any need of manual restart/deletion of pod in such cases.
Steps to reproduce
Grafana agent helm chart version 0.31.0 and app version v0.39.0
Helm values are added below in configuration section .
Environment:
Infrastructure: Kubernetes
Deployment tool: Helm
System information
No response
Software version
Grafana Agent v0.39.0
Configuration
Helm values.yaml:
nameOverride: grafana-agent
crds:
create: false
image:
tag: v0.39.0
service:
enabled: true
controller:
type: 'statefulset'
replicas: 4
autoscaling:
enabled: true
targetMemoryUtilizationPercentage: 50
minReplicas: 4
maxReplicas: 20
agent:
resources:
requests:
cpu: "4"
memory: "20Gi"
limits:
cpu: "4"
memory: "20Gi"
mode: 'flow'
clustering:
enabled: true
configMap:
content: |
prometheus.remote_write "mimir" {
endpoint {
url = "https://mimir-url.abcxyz/api/v1/push"
headers = {
"X-Scope-OrgID" = "tenantid",
}
}
}
/*
Service Monitors
/
prometheus.operator.servicemonitors "discover_servicemonitors" {
forward_to = [prometheus.remote_write.mimir.receiver]
selector {
match_expression {
key = "app.kubernetes.io/part-of"
operator = "NotIn"
values = ["prometheus-operator"]
}
match_expression {
key = "app.kubernetes.io/instance"
operator = "NotIn"
values = ["prom-op"]
}
}
clustering {
enabled = true
}
}
/
Pod Monitors
/
prometheus.operator.podmonitors "discover_podmonitors" {
forward_to = [prometheus.remote_write.mimir.receiver]
scrape {
default_scrape_interval = "30s"
}
clustering {
enabled = true
}
}
/
Probes
*/
prometheus.operator.probes "discover_probes" {
forward_to = [prometheus.remote_write.mimir.receiver]
scrape {
default_scrape_interval = "30s"
}
clustering {
enabled = true
}
}
Logs
ts=2024-09-23T05:04:54.362676449Z level=info msg="now listening for http traffic" service=http addr=0.0.0.0:80
ts=2024-09-23T05:04:54.362152043Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.servicemonitors.discover_servicemonitors
ts=2024-09-23T05:04:54.361663857Z level=info msg="scheduling loaded components and services"
ts=2024-09-23T05:04:54.362133503Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.probes.discover_probes
ts=2024-09-23T05:04:54.362076197Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.podmonitors.discover_podmonitors
ts=2024-09-23T05:04:54.361499105Z level=info msg="finished complete graph evaluation" controller_id="" trace_id=eaf937c20f85f3ce18dd408efb23c4ae duration=22.012674ms
ts=2024-09-23T05:04:54.361405421Z level=info msg="applying non-TLS config to HTTP server" service=http
ts=2024-09-23T05:05:24.363526504Z level=error msg="error running crd manager" component=prometheus.operator.podmonitors.discover_podmonitors err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:24.363546239Z level=info msg="scrape manager stopped" component=prometheus.operator.probes.discover_probes
ts=2024-09-23T05:05:24.363568264Z level=info msg="scrape manager stopped" component=prometheus.operator.podmonitors.discover_podmonitors
ts=2024-09-23T05:05:24.363491071Z level=error msg="error running crd manager" component=prometheus.operator.probes.discover_probes err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:24.363597696Z level=info msg="scrape manager stopped" component=prometheus.operator.servicemonitors.discover_servicemonitors
ts=2024-09-23T05:05:24.363558031Z level=error msg="error running crd manager" component=prometheus.operator.servicemonitors.discover_servicemonitors err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:44.36759843Z level=info msg="peers changed" new_peers=grafana-agent-5
ts=2024-09-23T05:05:44.367431093Z level=info msg="starting cluster node" peers="" advertise_addr=10.123.123.30:80
The text was updated successfully, but these errors were encountered: