-
Notifications
You must be signed in to change notification settings - Fork 301
DCGM exporter crashes when installed by helm3 #180
Comments
Is it because DCGM exporter does not support GeForce card as #141 said? But I'm still confused since my dcgm version is 2.3.1 and @dualvtable said that after v2.1.2 these errors are prevented. |
I am facing the same problem
and
|
As a workaround I've disabled the probes (liveness and readiness). The pod is not terminated/restarted anymore and Prometheus can now scrape the metrics. Perhaps, in the gpu-monitoring-tools/pkg/server.go Line 100 in 75e0a11
|
You can also override the livenessProbe: Set it to:
The issue being the livenessProbe is set to 5 seconds which is not enough for the process to start. |
THKS, fine now |
Hi all,
I followed the instructions in this guide to install the dcgm exporter in the prometheus framework. However, the dcgm exporter crashes.
when typing
kubectl logs dcgm-exporter-1619697251-pzb8q
, it shows as below.time="2021-04-29T12:35:59Z" level=info msg="Starting dcgm-exporter"
time="2021-04-29T12:35:59Z" level=info msg="DCGM successfully initialized!"
time="2021-04-29T12:35:59Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-04-29T12:35:59Z" level=info msg="Starting webserver"
time="2021-04-29T12:35:59Z" level=info msg="Pipeline starting"
when typing
kubectl describe pod dcgm-exporter-1619697251-pzb8q
, it shows as below.Warning Unhealthy 58m (x5 over 59m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 29m (x43 over 59m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
My kubernetes version is 1.21.0, and the prometheus chart is kube-prometheus-stack-15.2.3. The version of dcgm exporter is dcgm-exporter-2.3.1. I have 2 GeForce 1080TI card in my machine.
I don't know what exactly causes this failure, and I've tried a lot of posts but unluckily they did not solve my problem. This problem is quite urgent for me since it's part of my undergraduate thesis, so any help will be greatly appreciated. Thanks in advance.
The text was updated successfully, but these errors were encountered: