Error setting up dcgm with startHostEngine mode from a golang based container #66

haardm · 2024-05-10T21:22:57Z

I am creating a monitoring-agent based on golang using docker to build the image, and also install dcgm. My golang application uses startHostEngine mode to init dcgm client.

This agent image is pulled in a kubernetes pod as a daemonset. Inside the pod, I am getting below error.
error connecting to nv-hostengine: Host engine connection invalid/disconnected

Earlier, I had a separate container in the node to run my nvidia-dcgm image nvcr.io/nvidia/cloud-native/dcgm:3.3.5-1-ubuntu22.04, and used standAlone mode to connect- it worked fine.

I was able to successfully run it using embedded mode and eliminate the use of separate dcgm server container. But this broke my capability to ssh into the ec2 instance and run dcgmi test --inject commands to test error scenarios.

Is there a way to run dcgmi test with embedded mode that could work for my setup? I have also tried to make it work by ssh'ing inside the kubernetes pod of monitoring-agent but that does not work and I get below error.

sh-4.2$ dcgmi test --inject --gpuid 0 -f 202 -v 99999
Error: unable to establish a connection to the specified host: localhost
Error: Unable to connect to host engine. Host engine connection invalid/disconnected.

Just FYI, in this setup, I do not get any errors for dcgm.Init(dcgm.Embedded)
2. I switched to using dcgm.Init(dcgm.StartHostEngine) as StartHostengine is the mode which starts nv-hostengine, and also gives me the hope that it would eliminate server container + able to test using dcgmi. But currently I am facing init errors.
Error connecting to nv-hostengine: Host engine connection invalid/disconnected

The text was updated successfully, but these errors were encountered:

nikkon-dev · 2024-05-11T06:15:23Z

@haardm

Embedded hostengine works in the address space of your Go process and does not expose any endpoints to connect to. The dcgmi tool connects to the standalone nv-hostengine that listens for connections on either a Unix domain socket or a TCP port. There is no way for dcgmi to connect to your embedded hostengine.

It is important to note that only one instance of a host engine (embedded or standalone) should be used with a GPU on a given node. If two instances of the host engine address the same GPU, we cannot guarantee stable work, whether running on bare metal or within a container.

haardm mentioned this issue May 10, 2024

Error setting up dcgm with startHostEngine mode from a golang based container NVIDIA/DCGM#168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error setting up dcgm with startHostEngine mode from a golang based container #66

Error setting up dcgm with startHostEngine mode from a golang based container #66

haardm commented May 10, 2024

nikkon-dev commented May 11, 2024

Error setting up dcgm with startHostEngine mode from a golang based container #66

Error setting up dcgm with startHostEngine mode from a golang based container #66

Comments

haardm commented May 10, 2024

nikkon-dev commented May 11, 2024