Specifying Specific GPU Models for Pods in Nodes with Multiple GPU Types #656

anencore94 · 2024-01-18T08:54:31Z

2. Issue or feature description

I am currently working with a Kubernetes cluster where some nodes are equipped with multiple types of NVIDIA GPUs. For example, Node A has one A100 GPU and one V100 GPU. In such a setup, I am looking for a way to specify which GPU model should be allocated when a user creates a GPU-allocated pod.

From my understanding, in such cases, we would typically request a GPU in our pod specifications using resources.limits with nvidia.com/gpu: 1. However, this approach doesn't seem to provide a way to distinguish between different GPU models.

Is there a feature or method within the NVIDIA GPU Operator or Kubernetes ecosystem that allows for such specific GPU model selection during pod creation? If not, are there any best practices or recommended approaches to ensure a pod is scheduled with a specific type of GPU when multiple models are present in the same node?

Thank you for your time and assistance.

The text was updated successfully, but these errors were encountered:

cdesiniotis · 2024-01-25T17:52:44Z

@anencore94 there is unfortunately no supported way of accomplishing this today with the device plugin API.

Dynamic Resource Allocation, a new API for requesting and allocating resources in Kubernetes, would allow us to naturally support such configurations, but it is currently an alpha feature.

anencore94 · 2024-01-30T03:44:52Z

@cdesiniotis Thanks for sharing :). You mean implement this feature using Dynamic Resource Allocation API needs quite a long time, I guess..

laszlocph · 2024-03-04T19:31:31Z

I was able to pick the GPU by specifying the

apiVersion: v1
kind: Pod
metadata:
  name: vllm-openai
  namespace: training
spec:
  runtimeClassName: nvidia
  containers:
  - name: vllm-openai
    image: "vllm/vllm-openai:latest"
    args: ["--model", "Qwen/Qwen1.5-14B-Chat"]
+    env:
+    - name: NVIDIA_VISIBLE_DEVICES
+      value: "0"
    resources:
      limits:
        nvidia.com/gpu: 1

variable. Where the number is the zero-indexed number of my GPUs.

These other vars may also work, but have not tested them: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html

anencore94 · 2024-03-05T04:23:21Z

@laszlocph Thanks for your case! However, I'd like to control it in k8s way. 🥲

jjaymick001 · 2024-03-25T23:01:58Z

I do this via nodeSelector.

kubectl get nodes -L nvidia.com/gpu.count -L nvidia.com/gpu.product
NAME            STATUS   ROLES           AGE    VERSION   GPU.COUNT   GPU.PRODUCT
dell-mx740c-2   Ready    control-plane   3d8h   v1.26.3   1           NVIDIA-A100-PCIE-40GB
dell-mx740c-3   Ready    control-plane   3d8h   v1.26.3   2           Tesla-T4
dell-mx740c-7   Ready    <none>          3d8h   v1.26.3   2           Quadro-RTX-8000
dell-mx740c-8   Ready    <none>          3d8h   v1.26.3   2           NVIDIA-A100-PCIE-40GB

I can use gpu.product as the selector to ensure the pod lands on the intended GPU type
like this.

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-ver-740c-8
spec:
  restartPolicy: OnFailure
  nodeSelector:
     nvidia.com/gpu.product: "NVIDIA-A100-PCIE-40GB"
     nvidia.com/gpu.count: "2"
  containers:
  - name: nvidia-version-check
    image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: "1"

qiangyupei mentioned this issue Aug 25, 2024

Unexpected GPU Allocation with NVIDIA_VISIBLE_DEVICES in Kubernetes #951

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying Specific GPU Models for Pods in Nodes with Multiple GPU Types #656

Specifying Specific GPU Models for Pods in Nodes with Multiple GPU Types #656

anencore94 commented Jan 18, 2024

cdesiniotis commented Jan 25, 2024

anencore94 commented Jan 30, 2024

laszlocph commented Mar 4, 2024

anencore94 commented Mar 5, 2024

jjaymick001 commented Mar 25, 2024

Specifying Specific GPU Models for Pods in Nodes with Multiple GPU Types #656

Specifying Specific GPU Models for Pods in Nodes with Multiple GPU Types #656

Comments

anencore94 commented Jan 18, 2024

2. Issue or feature description

cdesiniotis commented Jan 25, 2024

anencore94 commented Jan 30, 2024

laszlocph commented Mar 4, 2024

anencore94 commented Mar 5, 2024

jjaymick001 commented Mar 25, 2024