A100: The GPU operator will not install the mig-manager #651

lsyLearn · 2024-01-09T13:03:12Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
Kernel Version:5.15.0-60-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):v1.24.15
GPU Operator Version:v23.6.0
GPU: A100 PCIe 40GB

2. Issue or feature description

When the driver and nfd were installed in advance, I am trying to install gpu-operator in the A100 environment, but the installation of gpu-operator failed and the mig-manager was missing.
GPU:
root@master1:~# lspci | grep NVIDIA 2f:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
The gpu operaotr pod info:

root@master1:~# kubectl get pod -n gpu-operator
NAME                                       READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-kht9c                1/1     Running                 1 (36m ago)      57m
gpu-operator-8597b78788-4ncg7              1/1     Running                 1 (36m ago)      57m
nvidia-container-toolkit-daemonset-pldv5   1/1     Running                 1 (36m ago)      57m
nvidia-cuda-validator-tgqqk                0/1     Init:CrashLoopBackOff   1 (17s ago)      19s
nvidia-dcgm-exporter-m7hg7                 1/1     Running                 1 (36m ago)      57m
nvidia-device-plugin-daemonset-gjlp7       0/1     CrashLoopBackOff        17 (4m47s ago)   57m
nvidia-operator-validator-7969z            0/1     Init:2/4                6 (2m59s ago)    57m

There is no nvidia-mig-manager pod.
And the error pod logs as follows:

root@master1:~# kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-gjlp7
...
I0109 12:33:24.644789       1 main.go:256] Retreiving plugins.
I0109 12:33:24.645553       1 factory.go:107] Detected NVML platform: found NVML library
I0109 12:33:24.645594       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0109 12:33:24.699911       1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration

3. Steps to reproduce the issue

Install k8s cluster;
Install nfd:

root@master1:~# kubectl get pod -n node-feature-discovery
NAME                                                         READY   STATUS    RESTARTS       AGE
nfd-release-node-feature-discovery-master-5564946bcf-x6qzs   1/1     Running   13 (49m ago)   43d
nfd-release-node-feature-discovery-worker-x7nff              1/1     Running   11 (49m ago)   43d

Install gpu driver: Driver Version: 535.129.03
Install the operator: helm install gpu-operator -n gpu-operator --create-namespace ./gpu-operator --set driver.enabled=false --set nfd.enabled=false
Check the gpu-operator pod.

The text was updated successfully, but these errors were encountered:

cdesiniotis · 2024-01-25T21:43:51Z

Closing as this appears to be a duplicate of #652

cdesiniotis added the duplicate label Jan 25, 2024

cdesiniotis closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A100: The GPU operator will not install the mig-manager #651

A100: The GPU operator will not install the mig-manager #651

lsyLearn commented Jan 9, 2024

cdesiniotis commented Jan 25, 2024

A100: The GPU operator will not install the mig-manager #651

A100: The GPU operator will not install the mig-manager #651

Comments

lsyLearn commented Jan 9, 2024

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

cdesiniotis commented Jan 25, 2024