Occasional metric loss and hangs in DCGM Exporter #38

zlseu-edu · 2023-06-01T13:11:01Z

I encountered a problem with DCGM Exporter where metrics occasionally go missing or hang. I have noticed that this issue does not occur consistently but happens intermittently, causing difficulties in monitoring and data analysis.

Environment Information

DCGM Exporter version: 3.1.7-3.1.4
Operating system: Ubuntu 20.04
GPU model: NVIDIA A100-PCIE-80GB
Other relevant software and hardware environment information: run dcgm-exporter as daemonset in kubernetes.

Expected Behavior

I expected DCGM Exporter to consistently collect and export metric data according to the configuration, without experiencing occasional loss and hangs.

Actual Behavior

All GPU metrics suddenly hang.

dcgm-exporter metrics lost gpu2 util metric.

no weired logs for dcgm-exporter and no kernel issues at that point.

nvidia-smi can display real statistic.

After restart dcgm-exporter pod, everything works fine.

Guess
After read some code about dcgm-exporter which call go-dcgm to fetch gpu metrics, I think there has some wrong with the go-exporter.

Please investigate this issue and provide support and guidance. Thank you!

bhperry · 2023-06-21T22:23:12Z

I noted this as well, and found that my dcgm pod was being repeatedly killed by the liveness probe. When I removed that, it started getting OOM killed instead.

Mind boggling to me that it uses over 128MiB steady state (on my cluster, at least). Not worth that much overhead just to get GPU usage metrics.

zlseu-edu · 2023-06-27T12:40:52Z

I noted this as well, and found that my dcgm pod was being repeatedly killed by the liveness probe. When I removed that, it started getting OOM killed instead.

Mind boggling to me that it uses over 128MiB steady state (on my cluster, at least). Not worth that much overhead just to get GPU usage metrics.

Use over 300MiB on my cluster. As a workaround, Ops system on my cluster will try to restart dcgm pod after monitoring gpu metric hang for 5min.

bhperry · 2023-06-27T14:23:52Z

I noted this as well, and found that my dcgm pod was being repeatedly killed by the liveness probe. When I removed that, it started getting OOM killed instead.
Mind boggling to me that it uses over 128MiB steady state (on my cluster, at least). Not worth that much overhead just to get GPU usage metrics.

Use over 300MiB on my cluster. As a workaround, Ops system on my cluster will try to restart dcgm pod after monitoring gpu metric hang for 5min.

Yikes. I believe it. Saw my usage steadily climbing the whole time it was up. We use dedicated nodes at my work (i.e. scheduled pods take up essentially the whole node) so sacrificing that much RAM for metrics is out of the question, even if it didn't require constant restarts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional metric loss and hangs in DCGM Exporter #38

Occasional metric loss and hangs in DCGM Exporter #38

zlseu-edu commented Jun 1, 2023 •

edited

Loading

bhperry commented Jun 21, 2023

zlseu-edu commented Jun 27, 2023

bhperry commented Jun 27, 2023

Occasional metric loss and hangs in DCGM Exporter #38

Occasional metric loss and hangs in DCGM Exporter #38

Comments

zlseu-edu commented Jun 1, 2023 • edited Loading

bhperry commented Jun 21, 2023

zlseu-edu commented Jun 27, 2023

bhperry commented Jun 27, 2023

zlseu-edu commented Jun 1, 2023 •

edited

Loading