-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Daemonset pods fail with: "nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown" #511
Comments
Did you manually edit the docker config? It doesn't look right with the toolkit container running. It should have pointed to
Also, please verify that the file |
@shivamerla No, I haven't manually edited the docker config. Here is my
It doesn't look like I can try making both those edits. |
I made the changes you suggested. I set the default docker runtime to
I then added a this symlink Thank you for the help! |
@shivamerla thanks for the instructions. We were able to use this and successfully get the pods of the nvidia-gpu-operator set up. However these configurations are set up 'incorrectly' by the nvidia-gpu-operator itself. We do not have a way to automate fixing this configuration with the workaround above. What is your recommendation for people who would want to automate this? (We are running Kubernetes Clusters and want to use the nvidia-gpu-operator to set up the nodes for AI/ML workloads. One difference that we have is that we run containerd instead of docker as the container runtime for k8s) |
@shivamerla It works fine after changing to / , but how can we make it permanent ? when I try to resinstall again facing same issue . Please let me know where we need to make changes in gpu-operator helm template to fix it permanently |
@lokanthak will debug this further. So after few iterations, the toolkit config is set incorrectly i.e |
Thanks @shivamerla , Yes this issue occurring when we install gpu-nvidia drivers on GPU node and try to use gpu-operator ,this issue is fixed after we uninstalled packages on GPU worker node , as workaround we can restart toolkit container when we run into this situation |
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? No, these are not installedkubectl describe clusterpolicies --all-namespaces
) Yes. I can provide the full output if desired1. Issue or feature description
I have a two node microk8s cluster where each node has an RTX 2060 in it. On one node, everything resolves fine and all the daemonset containers stand up properly. On the other node, all pods but the gpu-operator-node-feature-discovery-worker fails to initialize with the message:
failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
The gpu-operator-node-feature-discovery-worker runs, but has this error repeatedly in its logs:
On the host OS, I am able to run
nvidia-smi
successfully:I am also able to run the container runtime test:
2. Steps to reproduce the issue
I installed the GPU operator through the microk8s command line utility a la
microk8s enable gpu
. It installed version 22.9.0. I tried installing 22.9.2 but the offending node has the same issue. I rolled back because the currently functional node stopped working with the errorFailed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
.3. Information to attach (optional if deemed irrelevant)
GPU Operator Pods Status
gpu-feature-discovery-zbctb description
nvidia-device-plugin-daemonset-9j9tq description
nvidia-operator-validator-cx6fh description
I can add the last two if it helps, but they look the same to me.
cat /etc/docker/daemon.json
[+] Docker runtime configuration:
docker info | grep runtime
Runtimes: io.containerd.runtime.v1.linux nvidia runc io.containerd.runc.v2
NVIDIA shared directory:
ls -la /run/nvidia
ls -la /usr/local/nvidia/toolkit
ls -la /run/nvidia/driver
journalctl -u kubelet > kubelet.logs
There is nothing there.
I appreciate help anyone can give!
The text was updated successfully, but these errors were encountered: