Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemonset pods fail with: "nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown #625

Open
lokanthak opened this issue Dec 1, 2023 · 3 comments

Comments

@lokanthak
Copy link

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version : Ubuntu20.04
  • Kernel Version: 5.15.0-73-generic
  • Container Runtime Type/Version : Docker
  • K8s Flavor/Version: RKE , V25.0.9
  • GPU Operator Version: v23.9.0 and V23.3.2

2. Issue or feature description

Few gpu-operator daemon sets struck at crash loopback with following error
able to fix this issue temporarily by editing file /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml as discussed in this #511, but I want to fix it on operator as we are trying to deploy and work on multiple nodes as this manual procedure won't feasible

CrashLoopBackOff (back-off 5m0s restarting failed container=toolkit-validation pod=nvidia-operator-validator-sblpz_gpu-operator(6a789f5a-daf1-4576-8faf-aa8bbda36a3d)) | Last state: Terminated with 127: ContainerCannotRun (failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

3. Steps to reproduce the issue

This is intermittent issue occurring while installing and installing as part cluster rebuilds , we are running into this issue 3 times in 10 attempts

4. Information to attach (optional if deemed irrelevant)

#kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-xb9mw 0/1 Init:0/1 0 98m
gpu-operator-1701415793-node-feature-discovery-gc-79bf6849hvm46 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-master-67fbjjs54 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-797p5 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-g4v6p 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-lrc29 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-x8bsr 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-z5qxz 1/1 Running 0 98m
gpu-operator-6d6965d759-fcd47 1/1 Running 0 98m
nvidia-container-toolkit-daemonset-prmxs 1/1 Running 0 98m
nvidia-dcgm-exporter-psxbt 0/1 Init:0/1 0 98m
nvidia-device-plugin-daemonset-5pj75 0/1 Init:0/1 0 98m
nvidia-operator-validator-sblpz 0/1 Init:CrashLoopBackOff 23 (5m2s ago) 98m

% kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 99m
gpu-operator-1701415793-node-feature-discovery-worker 5 5 5 5 5 99m
nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 99m
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 99m
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 99m
nvidia-driver-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.driver=true 99m
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 99m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 99m

#kubectl describe pod -n gpu-operator nvidia-operator-validator-sblpz
Name: nvidia-operator-validator-sblpz
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Node: k8s-w1.123.example.com/10.0.0.5
Start Time: Fri, 01 Dec 2023 12:59:59 +0530
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=5b66558484
helm.sh/chart=gpu-operator-v23.9.0
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: 5d56001f5f05fbc263b2dc893e335392f9e7a70f916012203038c997fb9d91f1
cni.projectcalico.org/podIP: 192.168.1.13/32
cni.projectcalico.org/podIPs: 192.168.1.13/32
Status: Pending
IP: 192.168.1.13
IPs:
IP: 192.168.1.13
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: docker://394a02f1579eb7b60f939aea1085c338be11f537780ccfa28243adeedea09c33
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 01 Dec 2023 13:00:05 +0530
Finished: Fri, 01 Dec 2023 13:00:08 +0530
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hkcdm (ro)
toolkit-validation:
Container ID: docker://e0185d7848e691d98c5b37839b5de967b4c4074d1e42cd160d4bdd64a7c94894
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
Exit Code: 127
Started: Fri, 01 Dec 2023 14:38:08 +0530
Finished: Fri, 01 Dec 2023 14:38:08 +0530
Ready: False
Restart Count: 24

@lokanthak
Copy link
Author

I think we had identified the issue ... When you plan to use gpu-operator in K8S for running GPU workload ,either follow manual Nvidia driver installation or gpu-operator ? if we install nvidia drivers on GPU node and use GPU-operator on top of it we will run into this situation as toolkit demonset configure /run/nvidia/driver as driver path on the host instead of / on the container , which can lead to daemonset error state

@uptownben
Copy link

I am having the same issue. Wanted to get clarification from @lokanthak on the previous answer. If using the gpu-operator, we should NOT install nvidia driver on the host?

@ZYWNB666
Copy link

我遇到了同样的问题。想从@lokanthak关于上一个答案。如果使用 gpu-operator,我们不应该在主机上安装 nvidia 驱动程序?

sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants