Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana Agent components unhealthy because of k8s API server timeout during pod startup #7053

Open
ishaanmanaktalia opened this issue Oct 9, 2024 · 1 comment
Labels
bug Something isn't working needs-attention An issue or PR has been sitting around and needs attention.

Comments

@ishaanmanaktalia
Copy link

What's wrong?

We are running Grafana-agent statefulset pods with autoscaling enabled (Horizontal pod autoscaling enabled) in our Kubernetes cluster .
While launching new grafana-agent pod during Horizontal autoscaling, sometimes we noticed grafana agent components namely prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes going unhealthy because of API server timeout error during initialisation of pod. After this , Grafana-agent pod continue to stays in running state and showing these 3 components unhealthy without retrying to make connection to API server.

Here is how it looks in Grafana-agent UI
Image
Image
Image

On checking, we did not notice any issue directly with k8s API server and also ,other grafana-agent pods (which were part of same statefulset )were running healthy with all of their components (prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes and others )showing in healthy state .While this newly launched statefulset pod started by HPA for Grafana-agent statefulset continue showing these components unhealthy as soon as it starts .

The issue occurs sometimes and not every time . but it is reported by prometheus metric expression :
sum (agent_component_controller_running_components{health_type!="healthy"}) > 0

What is expected:
Grafana-agent in such cases can continue retrying for making connection to API server on discovering timeout for prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes components during initialization to retry to make its component healthy so that situation gets resolved without any need of manual restart/deletion of pod in such cases.

Steps to reproduce

Grafana agent helm chart version 0.31.0 and app version v0.39.0
Helm values are added below in configuration section .

Environment:

Infrastructure: Kubernetes
Deployment tool: Helm

System information

No response

Software version

Grafana Agent v0.39.0

Configuration

Helm values.yaml:
nameOverride: grafana-agent
crds:
create: false
image:
tag: v0.39.0
service:
enabled: true
controller:
type: 'statefulset'
replicas: 4
autoscaling:
enabled: true
targetMemoryUtilizationPercentage: 50
minReplicas: 4
maxReplicas: 20
agent:
resources:
requests:
cpu: "4"
memory: "20Gi"
limits:
cpu: "4"
memory: "20Gi"
mode: 'flow'
clustering:
enabled: true
configMap:
content: |
prometheus.remote_write "mimir" {
endpoint {
url = "https://mimir-url.abcxyz/api/v1/push"
headers = {
"X-Scope-OrgID" = "tenantid",
}
}
}
/*
Service Monitors
/
prometheus.operator.servicemonitors "discover_servicemonitors" {
forward_to = [prometheus.remote_write.mimir.receiver]
selector {
match_expression {
key = "app.kubernetes.io/part-of"
operator = "NotIn"
values = ["prometheus-operator"]
}
match_expression {
key = "app.kubernetes.io/instance"
operator = "NotIn"
values = ["prom-op"]
}
}
clustering {
enabled = true
}
}
/

Pod Monitors
/
prometheus.operator.podmonitors "discover_podmonitors" {
forward_to = [prometheus.remote_write.mimir.receiver]
scrape {
default_scrape_interval = "30s"
}
clustering {
enabled = true
}
}
/

Probes
*/
prometheus.operator.probes "discover_probes" {
forward_to = [prometheus.remote_write.mimir.receiver]
scrape {
default_scrape_interval = "30s"
}
clustering {
enabled = true
}
}

Logs

ts=2024-09-23T05:04:54.362676449Z level=info msg="now listening for http traffic" service=http addr=0.0.0.0:80
ts=2024-09-23T05:04:54.362152043Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.servicemonitors.discover_servicemonitors
ts=2024-09-23T05:04:54.361663857Z level=info msg="scheduling loaded components and services"
ts=2024-09-23T05:04:54.362133503Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.probes.discover_probes
ts=2024-09-23T05:04:54.362076197Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.podmonitors.discover_podmonitors
ts=2024-09-23T05:04:54.361499105Z level=info msg="finished complete graph evaluation" controller_id="" trace_id=eaf937c20f85f3ce18dd408efb23c4ae duration=22.012674ms
ts=2024-09-23T05:04:54.361405421Z level=info msg="applying non-TLS config to HTTP server" service=http
ts=2024-09-23T05:05:24.363526504Z level=error msg="error running crd manager" component=prometheus.operator.podmonitors.discover_podmonitors err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:24.363546239Z level=info msg="scrape manager stopped" component=prometheus.operator.probes.discover_probes
ts=2024-09-23T05:05:24.363568264Z level=info msg="scrape manager stopped" component=prometheus.operator.podmonitors.discover_podmonitors
ts=2024-09-23T05:05:24.363491071Z level=error msg="error running crd manager" component=prometheus.operator.probes.discover_probes err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:24.363597696Z level=info msg="scrape manager stopped" component=prometheus.operator.servicemonitors.discover_servicemonitors
ts=2024-09-23T05:05:24.363558031Z level=error msg="error running crd manager" component=prometheus.operator.servicemonitors.discover_servicemonitors err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:44.36759843Z level=info msg="peers changed" new_peers=grafana-agent-5
ts=2024-09-23T05:05:44.367431093Z level=info msg="starting cluster node" peers="" advertise_addr=10.123.123.30:80

@ishaanmanaktalia ishaanmanaktalia added the bug Something isn't working label Oct 9, 2024
Copy link
Contributor

github-actions bot commented Nov 9, 2024

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@github-actions github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-attention An issue or PR has been sitting around and needs attention.
Projects
None yet
Development

No branches or pull requests

1 participant