Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

DCGM exporter crashes when installed by helm3 #180

Open
jiangxiaosheng opened this issue Apr 29, 2021 · 5 comments
Open

DCGM exporter crashes when installed by helm3 #180

jiangxiaosheng opened this issue Apr 29, 2021 · 5 comments

Comments

@jiangxiaosheng
Copy link

Hi all,
I followed the instructions in this guide to install the dcgm exporter in the prometheus framework. However, the dcgm exporter crashes.
when typing kubectl logs dcgm-exporter-1619697251-pzb8q, it shows as below.

time="2021-04-29T12:35:59Z" level=info msg="Starting dcgm-exporter"
time="2021-04-29T12:35:59Z" level=info msg="DCGM successfully initialized!"
time="2021-04-29T12:35:59Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2021-04-29T12:35:59Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-04-29T12:35:59Z" level=info msg="Starting webserver"
time="2021-04-29T12:35:59Z" level=info msg="Pipeline starting"

when typing kubectl describe pod dcgm-exporter-1619697251-pzb8q, it shows as below.

Warning Unhealthy 58m (x5 over 59m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 29m (x43 over 59m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503

My kubernetes version is 1.21.0, and the prometheus chart is kube-prometheus-stack-15.2.3. The version of dcgm exporter is dcgm-exporter-2.3.1. I have 2 GeForce 1080TI card in my machine.

I don't know what exactly causes this failure, and I've tried a lot of posts but unluckily they did not solve my problem. This problem is quite urgent for me since it's part of my undergraduate thesis, so any help will be greatly appreciated. Thanks in advance.

@jiangxiaosheng
Copy link
Author

Is it because DCGM exporter does not support GeForce card as #141 said? But I'm still confused since my dcgm version is 2.3.1 and @dualvtable said that after v2.1.2 these errors are prevented.

@fabito
Copy link

fabito commented May 11, 2021

I am facing the same problem
Here is the output of dcgmi discovery -l :

4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:19:00.0                                         |
|        | Device UUID: GPU-fc67b07b-d44e-d387-2623-cdecf349ef9b                |
+--------+----------------------------------------------------------------------+
| 1      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:1A:00.0                                         |
|        | Device UUID: GPU-5ad20c11-6164-b709-27a4-e75eed635b49                |
+--------+----------------------------------------------------------------------+
| 2      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:67:00.0                                         |
|        | Device UUID: GPU-63caf029-1325-754e-0361-e30160b0432f                |
+--------+----------------------------------------------------------------------+
| 3      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:68:00.0                                         |
|        | Device UUID: GPU-71e4aeec-8cc9-db32-782a-87ef0d274db1                |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+

and nvidia-smi :

Tue May 11 17:36:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Quadro R...  On   | 00000000:19:00.0 Off |                  Off |
| 34%   28C    P8     8W / 230W |   4077MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Quadro R...  On   | 00000000:1A:00.0 Off |                  Off |
| 34%   30C    P8    16W / 230W |   1173MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA Quadro R...  On   | 00000000:67:00.0 Off |                  Off |
| 34%   30C    P8     7W / 230W |   1822MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA Quadro R...  On   | 00000000:68:00.0 Off |                  Off |
| 33%   32C    P8    14W / 230W |   1830MiB / 16122MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   1680644      C   /opt/conda/bin/python            1231MiB |
|    0   N/A  N/A   2798996      C   tritonserver                     2836MiB |
|    1   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A   2797762      C   /usr/bin/python3                 1165MiB |
|    2   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A   2798996      C   tritonserver                     1812MiB |
|    3   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  9MiB |
|    3   N/A  N/A      1543      G   /usr/bin/gnome-shell                3MiB |
|    3   N/A  N/A   2798996      C   tritonserver                     1810MiB |
+-----------------------------------------------------------------------------+

@fabito
Copy link

fabito commented May 11, 2021

As a workaround I've disabled the probes (liveness and readiness). The pod is not terminated/restarted anymore and Prometheus can now scrape the metrics.

Perhaps, in the /health endpoint, updateMetrics() should be invoked (at least once) before getMetrics() ?

if s.getMetrics() == "" {

@Sanhajio
Copy link

You can also override the livenessProbe:
$ kubectl edit daemonset.apps/dcgm-exporter

Set it to:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 9400
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

The issue being the livenessProbe is set to 5 seconds which is not enough for the process to start.

@nuckydong
Copy link

You can also override the livenessProbe:
$ kubectl edit daemonset.apps/dcgm-exporter

Set it to:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 9400
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

The issue being the livenessProbe is set to 5 seconds which is not enough for the process to start.

THKS, fine now

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants