Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemonset pods fail with: "nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown" #511

Closed
4 of 6 tasks
ianblitz opened this issue Mar 30, 2023 · 7 comments

Comments

@ianblitz
Copy link

ianblitz commented Mar 30, 2023

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node? No, I'm running on 2 Ubuntu 22.04 nodes.
  • Are you running Kubernetes v1.13+? Yes, both nodes are running 1.26.3
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Docker version 20.10.21
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes? No, these are not installed
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces) Yes. I can provide the full output if desired

1. Issue or feature description

I have a two node microk8s cluster where each node has an RTX 2060 in it. On one node, everything resolves fine and all the daemonset containers stand up properly. On the other node, all pods but the gpu-operator-node-feature-discovery-worker fails to initialize with the message:
failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

The gpu-operator-node-feature-discovery-worker runs, but has this error repeatedly in its logs:

I0330 23:06:39.997158 1 nfd-worker.go:484] feature discovery completed
E0330 23:07:40.031447 1 network.go:145] failed to read net iface attribute speed: read /host-sys/class/net/wlp0s20f3/speed: invalid argument

On the host OS, I am able to run nvidia-smi successfully:

Thu Mar 30 23:13:53 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   42C    P8     2W /  N/A |      1MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I am also able to run the container runtime test:

$ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-base-ubuntu20.04 nvidia-smi
Thu Mar 30 23:14:28 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   42C    P8     2W /  N/A |      1MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2. Steps to reproduce the issue

I installed the GPU operator through the microk8s command line utility a la microk8s enable gpu. It installed version 22.9.0. I tried installing 22.9.2 but the offending node has the same issue. I rolled back because the currently functional node stopped working with the error Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured.

3. Information to attach (optional if deemed irrelevant)

GPU Operator Pods Status

$ kubectl get pods --namespace gpu-operator-resources
NAME                                                          READY   STATUS                  RESTARTS        AGE
gpu-operator-node-feature-discovery-worker-tdbt4              1/1     Running                 0               64m
gpu-operator-node-feature-discovery-worker-hlch9              1/1     Running                 0               64m
gpu-operator-567cf74d9d-pgd9z                                 1/1     Running                 0               64m
gpu-operator-node-feature-discovery-master-79bc547944-px2tk   1/1     Running                 0               64m
nvidia-container-toolkit-daemonset-kc5sw                      1/1     Running                 0               63m
nvidia-device-plugin-daemonset-2s922                          1/1     Running                 0               63m
gpu-feature-discovery-m5clr                                   1/1     Running                 0               63m
nvidia-cuda-validator-hwjgh                                   0/1     Completed               0               63m
nvidia-dcgm-exporter-qc5p2                                    1/1     Running                 0               63m
nvidia-device-plugin-validator-gk824                          0/1     Completed               0               63m
nvidia-operator-validator-r5l4n                               1/1     Running                 0               63m
gpu-feature-discovery-zbctb                                   0/1     Init:CrashLoopBackOff   9 (4m37s ago)   25m
nvidia-device-plugin-daemonset-9j9tq                          0/1     Init:CrashLoopBackOff   17 (2m7s ago)   63m
nvidia-operator-validator-cx6fh                               0/1     Init:CrashLoopBackOff   17 (104s ago)   63m
nvidia-dcgm-exporter-xg6f4                                    0/1     Init:CrashLoopBackOff   17 (93s ago)    63m
nvidia-container-toolkit-daemonset-jfb84                      0/1     Init:CrashLoopBackOff   17 (85s ago)    63m

gpu-feature-discovery-zbctb description

$ kubectl describe pod gpu-feature-discovery-zbctb --namespace gpu-operator-resources
Name:                 gpu-feature-discovery-zbctb
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-gpu-feature-discovery
Node:                 nuc01/192.168.1.200
Start Time:           Thu, 30 Mar 2023 17:50:42 -0500
Labels:               app=gpu-feature-discovery
                      app.kubernetes.io/part-of=nvidia-gpu
                      controller-revision-hash=6f64589b4d
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: a510e81ae9a937bde44846d3cd77653702f7f39a09458e3fbd9ae5ff28839edf
                      cni.projectcalico.org/podIP: 10.1.207.114/32
                      cni.projectcalico.org/podIPs: 10.1.207.114/32
Status:               Pending
IP:                   10.1.207.114
IPs:
  IP:           10.1.207.114
Controlled By:  DaemonSet/gpu-feature-discovery
Init Containers:
  toolkit-validation:
    Container ID:  containerd://f67ee38974228452ce610dfddbc66ed206ee16f3ffb33d7ad1ff12c28a2de6a4
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    StartError
      Message:   failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
      Exit Code:    128
      Started:      Wed, 31 Dec 1969 18:00:00 -0600
      Finished:     Thu, 30 Mar 2023 18:16:58 -0500
    Ready:          False
    Restart Count:  10
    Environment:    <none>
    Mounts:
      /run/nvidia from run-nvidia (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tn45c (ro)
Containers:
  gpu-feature-discovery:
    Container ID:
    Image:          nvcr.io/nvidia/gpu-feature-discovery:v0.6.2-ubi8
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      GFD_SLEEP_INTERVAL:          60s
      GFD_FAIL_ON_INIT_ERROR:      true
      GFD_MIG_STRATEGY:            single
      NVIDIA_MIG_MONITOR_DEVICES:  all
    Mounts:
      /etc/kubernetes/node-feature-discovery/features.d from output-dir (rw)
      /sys/class/dmi/id/product_name from dmi-product-name (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tn45c (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  output-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/node-feature-discovery/features.d
    HostPathType:
  dmi-product-name:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/class/dmi/id/product_name
    HostPathType:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  Directory
  kube-api-access-tn45c:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.gpu-feature-discovery=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Warning  BackOff  3m52s (x118 over 28m)  kubelet  Back-off restarting failed container toolkit-validation in pod gpu-feature-discovery-zbctb_gpu-operator-resources(8cc741d1-82cb-4942-a8dc-0fe1ca82e65f)

nvidia-device-plugin-daemonset-9j9tq description

$ kubectl describe pod nvidia-device-plugin-daemonset-9j9tq --namespace gpu-operator-resources
Name:                 nvidia-device-plugin-daemonset-9j9tq
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-device-plugin
Node:                 nuc01/192.168.1.200
Start Time:           Thu, 30 Mar 2023 17:12:43 -0500
Labels:               app=nvidia-device-plugin-daemonset
                      controller-revision-hash=864db779c5
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: 727e53cf0951725b92b55d37a0d5e8f6240b6c066627a3127ef43ba59be19376
                      cni.projectcalico.org/podIP: 10.1.207.111/32
                      cni.projectcalico.org/podIPs: 10.1.207.111/32
Status:               Pending
IP:                   10.1.207.111
IPs:
  IP:           10.1.207.111
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
  toolkit-validation:
    Container ID:  containerd://4481396e4f10c98fc0a7e72386a8cacfb838bdd94f64214d0495e8450a169f55
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    StartError
      Message:   failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
      Exit Code:    128
      Started:      Wed, 31 Dec 1969 18:00:00 -0600
      Finished:     Thu, 30 Mar 2023 18:19:26 -0500
    Ready:          False
    Restart Count:  18
    Environment:    <none>
    Mounts:
      /run/nvidia from run-nvidia (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4qsbr (ro)
Containers:
  nvidia-device-plugin:
    Container ID:
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      [[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      PASS_DEVICE_SPECS:           true
      FAIL_ON_INIT_ERROR:          true
      DEVICE_LIST_STRATEGY:        envvar
      DEVICE_ID_STRATEGY:          uuid
      NVIDIA_VISIBLE_DEVICES:      all
      NVIDIA_DRIVER_CAPABILITIES:  all
      MIG_STRATEGY:                single
      NVIDIA_MIG_MONITOR_DEVICES:  all
    Mounts:
      /run/nvidia from run-nvidia (rw)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4qsbr (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  Directory
  kube-api-access-4qsbr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.device-plugin=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Warning  BackOff  2m52s (x301 over 68m)  kubelet  Back-off restarting failed container toolkit-validation in pod nvidia-device-plugin-daemonset-9j9tq_gpu-operator-resources(65fd83fc-a5b7-4377-a6a3-742b70db1a79)

nvidia-operator-validator-cx6fh description

$ kubectl describe pod nvidia-operator-validator-cx6fh --namespace gpu-operator-resources
Name:                 nvidia-operator-validator-cx6fh
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-operator-validator
Node:                 nuc01/192.168.1.200
Start Time:           Thu, 30 Mar 2023 17:12:43 -0500
Labels:               app=nvidia-operator-validator
                    app.kubernetes.io/part-of=gpu-operator
                    controller-revision-hash=64ddcb6bf9
                    pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: 534c5eb67fd9c1f84e63e6069cdf4ac33461bbda8a610c85deab28985e6e6ba9
                    cni.projectcalico.org/podIP: 10.1.207.72/32
                    cni.projectcalico.org/podIPs: 10.1.207.72/32
Status:               Pending
IP:                   10.1.207.72
IPs:
IP:           10.1.207.72
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
  Container ID:  containerd://97965f016bb4ca3d1a1d2b358a0b3e57874756af5fae42ff043b0d9d1d1c7db8
  Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
  Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:6fe4200960b2b49d6dac1c91e596f61dacb6b3dcff878c84eb74c5136fedd5b6
  Port:          <none>
  Host Port:     <none>
  Command:
    sh
    -c
  Args:
    nvidia-validator
  State:       Waiting
    Reason:    CrashLoopBackOff
  Last State:  Terminated
    Reason:    StartError
    Message:   failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
    Exit Code:    128
    Started:      Wed, 31 Dec 1969 18:00:00 -0600
    Finished:     Thu, 30 Mar 2023 18:19:45 -0500
  Ready:          False
  Restart Count:  18
  Environment:
    WITH_WAIT:  true
    COMPONENT:  driver
  Mounts:
    /host from host-root (ro)
    /run/nvidia/driver from driver-install-path (rw)
    /run/nvidia/validations from run-nvidia-validations (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cv9gg (ro)
toolkit-validation:
  Container ID:
  Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
  Image ID:
  Port:          <none>
  Host Port:     <none>
  Command:
    sh
    -c
  Args:
    nvidia-validator
  State:          Waiting
    Reason:       PodInitializing
  Ready:          False
  Restart Count:  0
  Environment:
    WITH_WAIT:  false
    COMPONENT:  toolkit
  Mounts:
    /run/nvidia/validations from run-nvidia-validations (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cv9gg (ro)
cuda-validation:
  Container ID:
  Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
  Image ID:
  Port:          <none>
  Host Port:     <none>
  Command:
    sh
    -c
  Args:
    nvidia-validator
  State:          Waiting
    Reason:       PodInitializing
  Ready:          False
  Restart Count:  0
  Environment:
    WITH_WAIT:                    false
    COMPONENT:                    cuda
    NODE_NAME:                     (v1:spec.nodeName)
    OPERATOR_NAMESPACE:           gpu-operator-resources (v1:metadata.namespace)
    VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
    VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    VALIDATOR_RUNTIME_CLASS:      nvidia
  Mounts:
    /run/nvidia/validations from run-nvidia-validations (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cv9gg (ro)
plugin-validation:
  Container ID:
  Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
  Image ID:
  Port:          <none>
  Host Port:     <none>
  Command:
    sh
    -c
  Args:
    nvidia-validator
  State:          Waiting
    Reason:       PodInitializing
  Ready:          False
  Restart Count:  0
  Environment:
    COMPONENT:                    plugin
    WITH_WAIT:                    false
    WITH_WORKLOAD:                true
    MIG_STRATEGY:                 single
    NODE_NAME:                     (v1:spec.nodeName)
    OPERATOR_NAMESPACE:           gpu-operator-resources (v1:metadata.namespace)
    VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
    VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
    VALIDATOR_RUNTIME_CLASS:      nvidia
  Mounts:
    /run/nvidia/validations from run-nvidia-validations (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cv9gg (ro)
Containers:
nvidia-operator-validator:
  Container ID:
  Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0
  Image ID:
  Port:          <none>
  Host Port:     <none>
  Command:
    sh
    -c
  Args:
    echo all validations are successful; sleep infinity
  State:          Waiting
    Reason:       PodInitializing
  Ready:          False
  Restart Count:  0
  Environment:    <none>
  Mounts:
    /run/nvidia/validations from run-nvidia-validations (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cv9gg (ro)
Conditions:
Type              Status
Initialized       False
Ready             False
ContainersReady   False
PodScheduled      True
Volumes:
run-nvidia-validations:
  Type:          HostPath (bare host directory volume)
  Path:          /run/nvidia/validations
  HostPathType:  DirectoryOrCreate
driver-install-path:
  Type:          HostPath (bare host directory volume)
  Path:          /run/nvidia/driver
  HostPathType:
host-root:
  Type:          HostPath (bare host directory volume)
  Path:          /
  HostPathType:
kube-api-access-cv9gg:
  Type:                    Projected (a volume that contains injected data from multiple sources)
  TokenExpirationSeconds:  3607
  ConfigMapName:           kube-root-ca.crt
  ConfigMapOptional:       <nil>
  DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                           node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                           node.kubernetes.io/not-ready:NoExecute op=Exists
                           node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                           node.kubernetes.io/unreachable:NoExecute op=Exists
                           node.kubernetes.io/unschedulable:NoSchedule op=Exists
                           nvidia.com/gpu:NoSchedule op=Exists
Events:
Type     Reason   Age                    From     Message
----     ------   ----                   ----     -------
Warning  BackOff  4m38s (x305 over 69m)  kubelet  Back-off restarting failed container driver-validation in pod nvidia-operator-validator-cx6fh_gpu-operator-resources(f0ef17bd-502d-4c52-b9bc-24620a3075a4)

I can add the last two if it helps, but they look the same to me.

  • [+] Docker configuration file: cat /etc/docker/daemon.json
   "runtimes": {
       "nvidia": {
           "args": [],
           "path": "nvidia-container-runtime"
       }
   }
}
  • [+] Docker runtime configuration: docker info | grep runtime
    Runtimes: io.containerd.runtime.v1.linux nvidia runc io.containerd.runc.v2

  • NVIDIA shared directory: ls -la /run/nvidia

$ ls -la /run/nvidia
total 0
drwxr-xr-x  4 root root   80 Mar 29 21:19 .
drwxr-xr-x 35 root root 1180 Mar 30 23:23 ..
drwxr-xr-x  2 root root   40 Mar 29 21:19 driver
drwxr-xr-x  2 root root   40 Mar 29 21:19 validations
  • [+] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
ls -la /usr/local/nvidia/toolkit
total 12920
drwxr-xr-x 3 root root    4096 Feb 22 03:17 .
drwxr-xr-x 3 root root    4096 Feb 22 03:17 ..
drwxr-xr-x 3 root root    4096 Feb 22 03:17 .config
lrwxrwxrwx 1 root root      32 Feb 22 03:17 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0
-rw-r--r-- 1 root root 2959384 Feb 22 03:17 libnvidia-container-go.so.1.11.0
lrwxrwxrwx 1 root root      29 Feb 22 03:17 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0
-rwxr-xr-x 1 root root  195856 Feb 22 03:17 libnvidia-container.so.1.11.0
-rwxr-xr-x 1 root root     154 Feb 22 03:17 nvidia-container-cli
-rwxr-xr-x 1 root root   47472 Feb 22 03:17 nvidia-container-cli.real
-rwxr-xr-x 1 root root     342 Feb 22 03:17 nvidia-container-runtime
-rwxr-xr-x 1 root root     350 Feb 22 03:17 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root     203 Feb 22 03:17 nvidia-container-runtime-hook
-rwxr-xr-x 1 root root 2142088 Feb 22 03:17 nvidia-container-runtime-hook.real
-rwxr-xr-x 1 root root 3771792 Feb 22 03:17 nvidia-container-runtime.experimental
-rwxr-xr-x 1 root root 4079040 Feb 22 03:17 nvidia-container-runtime.real
lrwxrwxrwx 1 root root      29 Feb 22 03:17 nvidia-container-toolkit -> nvidia-container-runtime-hook
  • [+] NVIDIA driver directory: ls -la /run/nvidia/driver
ls -la /run/nvidia/driver
total 0
drwxr-xr-x 2 root root 40 Mar 29 21:19 .
drwxr-xr-x 4 root root 80 Mar 29 21:19 ..
  • [+] kubelet logs journalctl -u kubelet > kubelet.logs
    There is nothing there.

I appreciate help anyone can give!

@shivamerla
Copy link
Contributor

Did you manually edit the docker config? It doesn't look right with the toolkit container running. It should have pointed to /usr/local/nvidia/toolkit/nvidia-container-runtime instead and also with defaut-runtime as nvidia as below.

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime",
            "runtimeArgs": []
      }
    }
}

Also, please verify that the file /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml has root parameter set to / in this case.

@ianblitz
Copy link
Author

@shivamerla No, I haven't manually edited the docker config.

Here is my /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml :

$ cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["docker-runc", "runc"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

It doesn't look like root currently points to /.

I can try making both those edits.

@ianblitz
Copy link
Author

ianblitz commented Apr 3, 2023

I made the changes you suggested. I set the default docker runtime to nvidia, changed the nvidia runtime path to "/usr/local/nvidia/toolkit/nvidia-container-runtime", and set the container runtime config.toml root value to /. After making these changes and restarting the node, the error message I was getting changed to:

nvidia-container-cli.real: ldcache error: open failed: /run/nvidia/driver/sbin/ldconfig.real: no such file or directory: unknown

I then added a this symlink ln -s /sbin /run/nvidia/driver/sbin, and all the containers stood up properly!

Thank you for the help!

@ianblitz ianblitz closed this as completed Apr 3, 2023
@arunmk
Copy link

arunmk commented Aug 2, 2023

@shivamerla thanks for the instructions. We were able to use this and successfully get the pods of the nvidia-gpu-operator set up. However these configurations are set up 'incorrectly' by the nvidia-gpu-operator itself. We do not have a way to automate fixing this configuration with the workaround above. What is your recommendation for people who would want to automate this?

(We are running Kubernetes Clusters and want to use the nvidia-gpu-operator to set up the nodes for AI/ML workloads. One difference that we have is that we run containerd instead of docker as the container runtime for k8s)

@lokanthak
Copy link

lokanthak commented Nov 30, 2023

@shivamerla It works fine after changing to / , but how can we make it permanent ? when I try to resinstall again facing same issue . Please let me know where we need to make changes in gpu-operator helm template to fix it permanently

@shivamerla
Copy link
Contributor

@lokanthak will debug this further. So after few iterations, the toolkit config is set incorrectly i.e root set to /run/nvidia/driver instead of /? This is when the driver is pre-installed on the node rather than with the driver container?

@lokanthak
Copy link

Problem's enabling GPU - Workaround included canonical/microk8s-core-addons#251

Thanks @shivamerla , Yes this issue occurring when we install gpu-nvidia drivers on GPU node and try to use gpu-operator ,this issue is fixed after we uninstalled packages on GPU worker node , as workaround we can restart toolkit container when we run into this situation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants