Unexpected GPU Allocation with NVIDIA_VISIBLE_DEVICES in Kubernetes #951

qiangyupei · 2024-08-25T09:18:51Z

1. Quick Debug Information

Kubernetes Version: v1.28
GPU Operator Version: v24.6.1

2. Issue description

The Kubernetes cluster has two worker nodes and each contains four A100 GPUs. During pod deployment, I use the NVIDIA_VISIBLE_DEVICES environment to specify which GPU to use (e.g., "3") (following the instructions in the link). However, when I run the kubectl exec -it [pod_name] -- nvidia-smi command, it sometimes shows only the specified GPU, but at other times, it displays an additional GPU alongside the specified one. The following picture illustrates the result. This causes some trouble for me. I'm wondering if there might be an issue.

I deploy GPU Operator with the following command:

helm install gpu-operator \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set driver.enabled=false \
    --set mig.strategy=mixed \
    -f gpu-operator-values.yaml \
    --set dcgmExporter.config.name=custom-dcgm-metrics

All the GPU-operator pods are running well:

The text was updated successfully, but these errors were encountered:

astranero · 2024-09-30T10:20:57Z

I have similar issue to this one, I was able to restrict how many GPUs it shows by enabling CDI. Additionally, I had to set optional settings

(Optional) Set the default container runtime mode to CDI by modifying the cluster policy:

kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
    -p='[{"op": "replace", "path": "/spec/cdi/default", "value":true}]'

, and had to also remove 'default_runtime_name = "nvidia"' from container runtime specs.

However, this did not fix all issues I am facing. It seems to schedule same GPU to multiple pods (GPUs should be allocated only once to a given container).

qiangyupei · 2024-10-03T12:47:05Z

Thanks for your solution!
When I enable CDI, however, the NVIDIA_VISIBLE_DEVICES environment seems to not work anymore, and the scheduler will randomly select a GPU for the pod.
In my use case, I want to allocate a pod to a specific MIG partition. I do not know why there will be an additional GPU visible by the pod.

cdesiniotis · 2024-10-08T21:59:38Z

The usage of NVIDIA_VISIBLE_DEVICES envvar in the pod spec is not recommended as it completely bypasses the Kubernetes device plugin API.

@qiangyupei can you check what the value of the NVIDIA_VISIBLE_DEVICES environment variable is in the container when additional / unexpected GPU devices are visible?

astranero · 2024-10-09T04:17:08Z

I think issue is with using resource requests ”nvidia.com” with this combination. If you wish to just use NVIDIA_VISIBLE_DEVICES, then removing requests of nvidia.com should remove additional GPUs as it did for me, because I wish to handle devices with envvar. However, this will have an effect where all unprivileged pods see PCI BUS of the GPUs ke 9.10.2024 klo 1.00 Christopher Desiniotis ***@***.***> kirjoitti:

…

The usage of NVIDIA_VISIBLE_DEVICES envvar in the pod spec is not recommended as it completely bypasses the Kubernetes device plugin API. @qiangyupei <https://github.com/qiangyupei> can you check what the value of the NVIDIA_VISIBLE_DEVICES environment variable is in the container when additional / unexpected GPU devices are visible? — Reply to this email directly, view it on GitHub <#951 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASNIXYWYNP26PRKGNZV56WLZ2RIWBAVCNFSM6AAAAABNCL3TDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBQHA4TKMBZGA> . You are receiving this because you commented.Message ID: ***@***.***>

qiangyupei · 2024-10-15T02:02:09Z

Hi, at the moment, I don’t have access to the A100 servers. However, if I recall correctly, I did check the value of the NVIDIA_VISIBLE_DEVICES environment variable inside the container, and it matched exactly what I had set in the deployment yaml file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected GPU Allocation with NVIDIA_VISIBLE_DEVICES in Kubernetes #951

Unexpected GPU Allocation with NVIDIA_VISIBLE_DEVICES in Kubernetes #951

qiangyupei commented Aug 25, 2024

astranero commented Sep 30, 2024 •

edited

Loading

qiangyupei commented Oct 3, 2024

cdesiniotis commented Oct 8, 2024

astranero commented Oct 9, 2024 via email •

edited

Loading

qiangyupei commented Oct 15, 2024

Unexpected GPU Allocation with NVIDIA_VISIBLE_DEVICES in Kubernetes #951

Unexpected GPU Allocation with NVIDIA_VISIBLE_DEVICES in Kubernetes #951

Comments

qiangyupei commented Aug 25, 2024

1. Quick Debug Information

2. Issue description

astranero commented Sep 30, 2024 • edited Loading

qiangyupei commented Oct 3, 2024

cdesiniotis commented Oct 8, 2024

astranero commented Oct 9, 2024 via email • edited Loading

qiangyupei commented Oct 15, 2024

astranero commented Sep 30, 2024 •

edited

Loading

astranero commented Oct 9, 2024 via email •

edited

Loading