Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected GPU Allocation with NVIDIA_VISIBLE_DEVICES in Kubernetes #951

Open
qiangyupei opened this issue Aug 25, 2024 · 5 comments
Open

Comments

@qiangyupei
Copy link

1. Quick Debug Information

  • Kubernetes Version: v1.28
  • GPU Operator Version: v24.6.1

2. Issue description

The Kubernetes cluster has two worker nodes and each contains four A100 GPUs. During pod deployment, I use the NVIDIA_VISIBLE_DEVICES environment to specify which GPU to use (e.g., "3") (following the instructions in the link). However, when I run the kubectl exec -it [pod_name] -- nvidia-smi command, it sometimes shows only the specified GPU, but at other times, it displays an additional GPU alongside the specified one. The following picture illustrates the result. This causes some trouble for me. I'm wondering if there might be an issue.

image

I deploy GPU Operator with the following command:

helm install gpu-operator \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set driver.enabled=false \
    --set mig.strategy=mixed \
    -f gpu-operator-values.yaml \
    --set dcgmExporter.config.name=custom-dcgm-metrics

All the GPU-operator pods are running well:

image

@astranero
Copy link

astranero commented Sep 30, 2024

I have similar issue to this one, I was able to restrict how many GPUs it shows by enabling CDI. Additionally, I had to set optional settings

(Optional) Set the default container runtime mode to CDI by modifying the cluster policy:

kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
    -p='[{"op": "replace", "path": "/spec/cdi/default", "value":true}]'

, and had to also remove 'default_runtime_name = "nvidia"' from container runtime specs.

However, this did not fix all issues I am facing. It seems to schedule same GPU to multiple pods (GPUs should be allocated only once to a given container).

@qiangyupei
Copy link
Author

Thanks for your solution!
When I enable CDI, however, the NVIDIA_VISIBLE_DEVICES environment seems to not work anymore, and the scheduler will randomly select a GPU for the pod.
In my use case, I want to allocate a pod to a specific MIG partition. I do not know why there will be an additional GPU visible by the pod.

@cdesiniotis
Copy link
Contributor

The usage of NVIDIA_VISIBLE_DEVICES envvar in the pod spec is not recommended as it completely bypasses the Kubernetes device plugin API.

@qiangyupei can you check what the value of the NVIDIA_VISIBLE_DEVICES environment variable is in the container when additional / unexpected GPU devices are visible?

@astranero
Copy link

astranero commented Oct 9, 2024 via email

@qiangyupei
Copy link
Author

Hi, at the moment, I don’t have access to the A100 servers. However, if I recall correctly, I did check the value of the NVIDIA_VISIBLE_DEVICES environment variable inside the container, and it matched exactly what I had set in the deployment yaml file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants