Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error setting up dcgm with startHostEngine mode from a golang based container #66

Open
haardm opened this issue May 10, 2024 · 1 comment

Comments

@haardm
Copy link

haardm commented May 10, 2024

I am creating a monitoring-agent based on golang using docker to build the image, and also install dcgm. My golang application uses startHostEngine mode to init dcgm client.

This agent image is pulled in a kubernetes pod as a daemonset. Inside the pod, I am getting below error.
error connecting to nv-hostengine: Host engine connection invalid/disconnected

Earlier, I had a separate container in the node to run my nvidia-dcgm image nvcr.io/nvidia/cloud-native/dcgm:3.3.5-1-ubuntu22.04, and used standAlone mode to connect- it worked fine.

I was able to successfully run it using embedded mode and eliminate the use of separate dcgm server container. But this broke my capability to ssh into the ec2 instance and run dcgmi test --inject commands to test error scenarios.

  1. Is there a way to run dcgmi test with embedded mode that could work for my setup? I have also tried to make it work by ssh'ing inside the kubernetes pod of monitoring-agent but that does not work and I get below error.
sh-4.2$ dcgmi test --inject --gpuid 0 -f 202 -v 99999
Error: unable to establish a connection to the specified host: localhost
Error: Unable to connect to host engine. Host engine connection invalid/disconnected.

Just FYI, in this setup, I do not get any errors for dcgm.Init(dcgm.Embedded)
2. I switched to using dcgm.Init(dcgm.StartHostEngine) as StartHostengine is the mode which starts nv-hostengine, and also gives me the hope that it would eliminate server container + able to test using dcgmi. But currently I am facing init errors.
Error connecting to nv-hostengine: Host engine connection invalid/disconnected

@nikkon-dev
Copy link
Collaborator

@haardm

Embedded hostengine works in the address space of your Go process and does not expose any endpoints to connect to. The dcgmi tool connects to the standalone nv-hostengine that listens for connections on either a Unix domain socket or a TCP port. There is no way for dcgmi to connect to your embedded hostengine.

It is important to note that only one instance of a host engine (embedded or standalone) should be used with a GPU on a given node. If two instances of the host engine address the same GPU, we cannot guarantee stable work, whether running on bare metal or within a container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants