-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected GPU Allocation with NVIDIA_VISIBLE_DEVICES in Kubernetes #951
Comments
I have similar issue to this one, I was able to restrict how many GPUs it shows by enabling CDI. Additionally, I had to set optional settings
, and had to also remove 'default_runtime_name = "nvidia"' from container runtime specs. However, this did not fix all issues I am facing. It seems to schedule same GPU to multiple pods (GPUs should be allocated only once to a given container). |
Thanks for your solution! |
The usage of @qiangyupei can you check what the value of the |
I think issue is with using resource requests ”nvidia.com” with this
combination. If you wish to just use NVIDIA_VISIBLE_DEVICES, then removing
requests of nvidia.com should remove additional GPUs as it did for me,
because I wish to handle devices with envvar. However, this will have an
effect where all unprivileged pods see PCI BUS of the GPUs
ke 9.10.2024 klo 1.00 Christopher Desiniotis ***@***.***>
kirjoitti:
… The usage of NVIDIA_VISIBLE_DEVICES envvar in the pod spec is not
recommended as it completely bypasses the Kubernetes device plugin API.
@qiangyupei <https://github.com/qiangyupei> can you check what the value
of the NVIDIA_VISIBLE_DEVICES environment variable is in the container
when additional / unexpected GPU devices are visible?
—
Reply to this email directly, view it on GitHub
<#951 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASNIXYWYNP26PRKGNZV56WLZ2RIWBAVCNFSM6AAAAABNCL3TDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBQHA4TKMBZGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi, at the moment, I don’t have access to the A100 servers. However, if I recall correctly, I did check the value of the NVIDIA_VISIBLE_DEVICES environment variable inside the container, and it matched exactly what I had set in the deployment yaml file. |
1. Quick Debug Information
2. Issue description
The Kubernetes cluster has two worker nodes and each contains four A100 GPUs. During pod deployment, I use the NVIDIA_VISIBLE_DEVICES environment to specify which GPU to use (e.g., "3") (following the instructions in the link). However, when I run the
kubectl exec -it [pod_name] -- nvidia-smi
command, it sometimes shows only the specified GPU, but at other times, it displays an additional GPU alongside the specified one. The following picture illustrates the result. This causes some trouble for me. I'm wondering if there might be an issue.I deploy GPU Operator with the following command:
All the GPU-operator pods are running well:
The text was updated successfully, but these errors were encountered: