DCGM exporter crashes when installed by helm3 #180

jiangxiaosheng · 2021-04-29T13:00:35Z

Hi all,
I followed the instructions in this guide to install the dcgm exporter in the prometheus framework. However, the dcgm exporter crashes.
when typing kubectl logs dcgm-exporter-1619697251-pzb8q, it shows as below.

time="2021-04-29T12:35:59Z" level=info msg="Starting dcgm-exporter"
time="2021-04-29T12:35:59Z" level=info msg="DCGM successfully initialized!"
time="2021-04-29T12:35:59Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-04-29T12:35:59Z" level=info msg="Starting webserver"
time="2021-04-29T12:35:59Z" level=info msg="Pipeline starting"

when typing kubectl describe pod dcgm-exporter-1619697251-pzb8q, it shows as below.

Warning Unhealthy 58m (x5 over 59m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 29m (x43 over 59m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503

My kubernetes version is 1.21.0, and the prometheus chart is kube-prometheus-stack-15.2.3. The version of dcgm exporter is dcgm-exporter-2.3.1. I have 2 GeForce 1080TI card in my machine.

I don't know what exactly causes this failure, and I've tried a lot of posts but unluckily they did not solve my problem. This problem is quite urgent for me since it's part of my undergraduate thesis, so any help will be greatly appreciated. Thanks in advance.

The text was updated successfully, but these errors were encountered:

jiangxiaosheng · 2021-04-29T13:08:44Z

Is it because DCGM exporter does not support GeForce card as #141 said? But I'm still confused since my dcgm version is 2.3.1 and @dualvtable said that after v2.1.2 these errors are prevented.

fabito · 2021-05-11T05:48:26Z

I am facing the same problem
Here is the output of dcgmi discovery -l :

4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:19:00.0                                         |
|        | Device UUID: GPU-fc67b07b-d44e-d387-2623-cdecf349ef9b                |
+--------+----------------------------------------------------------------------+
| 1      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:1A:00.0                                         |
|        | Device UUID: GPU-5ad20c11-6164-b709-27a4-e75eed635b49                |
+--------+----------------------------------------------------------------------+
| 2      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:67:00.0                                         |
|        | Device UUID: GPU-63caf029-1325-754e-0361-e30160b0432f                |
+--------+----------------------------------------------------------------------+
| 3      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:68:00.0                                         |
|        | Device UUID: GPU-71e4aeec-8cc9-db32-782a-87ef0d274db1                |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+

and nvidia-smi :

Tue May 11 17:36:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Quadro R...  On   | 00000000:19:00.0 Off |                  Off |
| 34%   28C    P8     8W / 230W |   4077MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Quadro R...  On   | 00000000:1A:00.0 Off |                  Off |
| 34%   30C    P8    16W / 230W |   1173MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA Quadro R...  On   | 00000000:67:00.0 Off |                  Off |
| 34%   30C    P8     7W / 230W |   1822MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA Quadro R...  On   | 00000000:68:00.0 Off |                  Off |
| 33%   32C    P8    14W / 230W |   1830MiB / 16122MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   1680644      C   /opt/conda/bin/python            1231MiB |
|    0   N/A  N/A   2798996      C   tritonserver                     2836MiB |
|    1   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A   2797762      C   /usr/bin/python3                 1165MiB |
|    2   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A   2798996      C   tritonserver                     1812MiB |
|    3   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  9MiB |
|    3   N/A  N/A      1543      G   /usr/bin/gnome-shell                3MiB |
|    3   N/A  N/A   2798996      C   tritonserver                     1810MiB |
+-----------------------------------------------------------------------------+

fabito · 2021-05-11T10:36:08Z

As a workaround I've disabled the probes (liveness and readiness). The pod is not terminated/restarted anymore and Prometheus can now scrape the metrics.

Perhaps, in the /health endpoint, updateMetrics() should be invoked (at least once) before getMetrics() ?

gpu-monitoring-tools/pkg/server.go

Line 100 in 75e0a11

if s.getMetrics() == "" {

Sanhajio · 2021-07-19T14:40:20Z

You can also override the livenessProbe:
$ kubectl edit daemonset.apps/dcgm-exporter

Set it to:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 9400
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

The issue being the livenessProbe is set to 5 seconds which is not enough for the process to start.

nuckydong · 2021-08-09T08:39:29Z

You can also override the livenessProbe:
$ kubectl edit daemonset.apps/dcgm-exporter

Set it to:
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 9400
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
The issue being the livenessProbe is set to 5 seconds which is not enough for the process to start.

THKS, fine now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCGM exporter crashes when installed by helm3 #180

DCGM exporter crashes when installed by helm3 #180

jiangxiaosheng commented Apr 29, 2021

jiangxiaosheng commented Apr 29, 2021

fabito commented May 11, 2021

fabito commented May 11, 2021 •

edited

Loading

Sanhajio commented Jul 19, 2021

nuckydong commented Aug 9, 2021

DCGM exporter crashes when installed by helm3 #180

DCGM exporter crashes when installed by helm3 #180

Comments

jiangxiaosheng commented Apr 29, 2021

jiangxiaosheng commented Apr 29, 2021

fabito commented May 11, 2021

fabito commented May 11, 2021 • edited Loading

Sanhajio commented Jul 19, 2021

nuckydong commented Aug 9, 2021

fabito commented May 11, 2021 •

edited

Loading