You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am creating a monitoring-agent based on golang using docker to build the image, and also install dcgm. My golang application uses startHostEngine mode to init dcgm client.
This agent image is pulled in a kubernetes pod as a daemonset. Inside the pod, I am getting below error. error connecting to nv-hostengine: Host engine connection invalid/disconnected
Earlier, I had a separate container in the node to run my nvidia-dcgm image nvcr.io/nvidia/cloud-native/dcgm:3.3.5-1-ubuntu22.04, and used standAlone mode to connect- it worked fine.
I was able to successfully run it using embedded mode and eliminate the use of separate dcgm server container. But this broke my capability to ssh into the ec2 instance and run dcgmi test --inject commands to test error scenarios.
Is there a way to run dcgmi test with embedded mode that could work for my setup? I have also tried to make it work by ssh'ing inside the kubernetes pod of monitoring-agent but that does not work and I get below error.
sh-4.2$ dcgmi test --inject --gpuid 0 -f 202 -v 99999
Error: unable to establish a connection to the specified host: localhost
Error: Unable to connect to host engine. Host engine connection invalid/disconnected.
Just FYI, in this setup, I do not get any errors for dcgm.Init(dcgm.Embedded)
2. I switched to using dcgm.Init(dcgm.StartHostEngine) as StartHostengine is the mode which starts nv-hostengine, and also gives me the hope that it would eliminate server container + able to test using dcgmi. But currently I am facing init errors. Error connecting to nv-hostengine: Host engine connection invalid/disconnected
The text was updated successfully, but these errors were encountered:
Embedded hostengine works in the address space of your Go process and does not expose any endpoints to connect to. The dcgmi tool connects to the standalone nv-hostengine that listens for connections on either a Unix domain socket or a TCP port. There is no way for dcgmi to connect to your embedded hostengine.
It is important to note that only one instance of a host engine (embedded or standalone) should be used with a GPU on a given node. If two instances of the host engine address the same GPU, we cannot guarantee stable work, whether running on bare metal or within a container.
I am creating a monitoring-agent based on golang using docker to build the image, and also install dcgm. My golang application uses startHostEngine mode to init dcgm client.
This agent image is pulled in a kubernetes pod as a daemonset. Inside the pod, I am getting below error.
error connecting to nv-hostengine: Host engine connection invalid/disconnected
Earlier, I had a separate container in the node to run my nvidia-dcgm image
nvcr.io/nvidia/cloud-native/dcgm:3.3.5-1-ubuntu22.04
, and used standAlone mode to connect- it worked fine.I was able to successfully run it using embedded mode and eliminate the use of separate dcgm server container. But this broke my capability to ssh into the ec2 instance and run
dcgmi test --inject
commands to test error scenarios.Just FYI, in this setup, I do not get any errors for
dcgm.Init(dcgm.Embedded)
2. I switched to using dcgm.Init(dcgm.StartHostEngine) as StartHostengine is the mode which starts
nv-hostengine
, and also gives me the hope that it would eliminate server container + able to test using dcgmi. But currently I am facing init errors.Error connecting to nv-hostengine: Host engine connection invalid/disconnected
The text was updated successfully, but these errors were encountered: