Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A100: The GPU operator will not install the mig-manager #651

Closed
lsyLearn opened this issue Jan 9, 2024 · 1 comment
Closed

A100: The GPU operator will not install the mig-manager #651

lsyLearn opened this issue Jan 9, 2024 · 1 comment

Comments

@lsyLearn
Copy link

lsyLearn commented Jan 9, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
  • Kernel Version:5.15.0-60-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):v1.24.15
  • GPU Operator Version:v23.6.0
  • GPU: A100 PCIe 40GB

2. Issue or feature description

When the driver and nfd were installed in advance, I am trying to install gpu-operator in the A100 environment, but the installation of gpu-operator failed and the mig-manager was missing.
GPU:
root@master1:~# lspci | grep NVIDIA 2f:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
The gpu operaotr pod info:

root@master1:~# kubectl get pod -n gpu-operator
NAME                                       READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-kht9c                1/1     Running                 1 (36m ago)      57m
gpu-operator-8597b78788-4ncg7              1/1     Running                 1 (36m ago)      57m
nvidia-container-toolkit-daemonset-pldv5   1/1     Running                 1 (36m ago)      57m
nvidia-cuda-validator-tgqqk                0/1     Init:CrashLoopBackOff   1 (17s ago)      19s
nvidia-dcgm-exporter-m7hg7                 1/1     Running                 1 (36m ago)      57m
nvidia-device-plugin-daemonset-gjlp7       0/1     CrashLoopBackOff        17 (4m47s ago)   57m
nvidia-operator-validator-7969z            0/1     Init:2/4                6 (2m59s ago)    57m

There is no nvidia-mig-manager pod.
And the error pod logs as follows:

root@master1:~# kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-gjlp7
...
I0109 12:33:24.644789       1 main.go:256] Retreiving plugins.
I0109 12:33:24.645553       1 factory.go:107] Detected NVML platform: found NVML library
I0109 12:33:24.645594       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0109 12:33:24.699911       1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration

3. Steps to reproduce the issue

  • Install k8s cluster;
  • Install nfd:
root@master1:~# kubectl get pod -n node-feature-discovery
NAME                                                         READY   STATUS    RESTARTS       AGE
nfd-release-node-feature-discovery-master-5564946bcf-x6qzs   1/1     Running   13 (49m ago)   43d
nfd-release-node-feature-discovery-worker-x7nff              1/1     Running   11 (49m ago)   43d
  • Install gpu driver: Driver Version: 535.129.03
    nvidia-smi
  • Install the operator: helm install gpu-operator -n gpu-operator --create-namespace ./gpu-operator --set driver.enabled=false --set nfd.enabled=false
  • Check the gpu-operator pod.
@cdesiniotis
Copy link
Contributor

Closing as this appears to be a duplicate of #652

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants