You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
OS/Version : Ubuntu20.04
Kernel Version: 5.15.0-73-generic
Container Runtime Type/Version : Docker
K8s Flavor/Version: RKE , V25.0.9
GPU Operator Version: v23.9.0 and V23.3.2
2. Issue or feature description
Few gpu-operator daemon sets struck at crash loopback with following error
able to fix this issue temporarily by editing file /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml as discussed in this #511, but I want to fix it on operator as we are trying to deploy and work on multiple nodes as this manual procedure won't feasible
CrashLoopBackOff (back-off 5m0s restarting failed container=toolkit-validation pod=nvidia-operator-validator-sblpz_gpu-operator(6a789f5a-daf1-4576-8faf-aa8bbda36a3d)) | Last state: Terminated with 127: ContainerCannotRun (failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
3. Steps to reproduce the issue
This is intermittent issue occurring while installing and installing as part cluster rebuilds , we are running into this issue 3 times in 10 attempts
4. Information to attach (optional if deemed irrelevant)
I think we had identified the issue ... When you plan to use gpu-operator in K8S for running GPU workload ,either follow manual Nvidia driver installation or gpu-operator ? if we install nvidia drivers on GPU node and use GPU-operator on top of it we will run into this situation as toolkit demonset configure /run/nvidia/driver as driver path on the host instead of / on the container , which can lead to daemonset error state
I am having the same issue. Wanted to get clarification from @lokanthak on the previous answer. If using the gpu-operator, we should NOT install nvidia driver on the host?
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Few gpu-operator daemon sets struck at crash loopback with following error
able to fix this issue temporarily by editing file /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml as discussed in this #511, but I want to fix it on operator as we are trying to deploy and work on multiple nodes as this manual procedure won't feasible
CrashLoopBackOff (back-off 5m0s restarting failed container=toolkit-validation pod=nvidia-operator-validator-sblpz_gpu-operator(6a789f5a-daf1-4576-8faf-aa8bbda36a3d)) | Last state: Terminated with 127: ContainerCannotRun (failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
3. Steps to reproduce the issue
This is intermittent issue occurring while installing and installing as part cluster rebuilds , we are running into this issue 3 times in 10 attempts
4. Information to attach (optional if deemed irrelevant)
#kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-xb9mw 0/1 Init:0/1 0 98m
gpu-operator-1701415793-node-feature-discovery-gc-79bf6849hvm46 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-master-67fbjjs54 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-797p5 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-g4v6p 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-lrc29 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-x8bsr 1/1 Running 0 98m
gpu-operator-1701415793-node-feature-discovery-worker-z5qxz 1/1 Running 0 98m
gpu-operator-6d6965d759-fcd47 1/1 Running 0 98m
nvidia-container-toolkit-daemonset-prmxs 1/1 Running 0 98m
nvidia-dcgm-exporter-psxbt 0/1 Init:0/1 0 98m
nvidia-device-plugin-daemonset-5pj75 0/1 Init:0/1 0 98m
nvidia-operator-validator-sblpz 0/1 Init:CrashLoopBackOff 23 (5m2s ago) 98m
% kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 99m
gpu-operator-1701415793-node-feature-discovery-worker 5 5 5 5 5 99m
nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 99m
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 99m
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 99m
nvidia-driver-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.driver=true 99m
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 99m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 99m
#kubectl describe pod -n gpu-operator nvidia-operator-validator-sblpz
Name: nvidia-operator-validator-sblpz
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Node: k8s-w1.123.example.com/10.0.0.5
Start Time: Fri, 01 Dec 2023 12:59:59 +0530
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=5b66558484
helm.sh/chart=gpu-operator-v23.9.0
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: 5d56001f5f05fbc263b2dc893e335392f9e7a70f916012203038c997fb9d91f1
cni.projectcalico.org/podIP: 192.168.1.13/32
cni.projectcalico.org/podIPs: 192.168.1.13/32
Status: Pending
IP: 192.168.1.13
IPs:
IP: 192.168.1.13
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: docker://394a02f1579eb7b60f939aea1085c338be11f537780ccfa28243adeedea09c33
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 01 Dec 2023 13:00:05 +0530
Finished: Fri, 01 Dec 2023 13:00:08 +0530
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hkcdm (ro)
toolkit-validation:
Container ID: docker://e0185d7848e691d98c5b37839b5de967b4c4074d1e42cd160d4bdd64a7c94894
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
Image ID: docker-pullable://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
Exit Code: 127
Started: Fri, 01 Dec 2023 14:38:08 +0530
Finished: Fri, 01 Dec 2023 14:38:08 +0530
Ready: False
Restart Count: 24
The text was updated successfully, but these errors were encountered: