-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gpu uuids to node labels #1015
Comments
@xiongzubiao could you describe how you would want to use these labels? In general the labels are intented to allow selection of specific nodes through node selectors or affinity. Is there a use case that you have which requires you to match nodes by UUID? |
It is mainly for metering and diagnosis purpose. We'd like to monitor the usage and the health status of each GPU. Having UUIDs in node label can help us to search data in prometheus. We don't have a use case to select a particular GPU right now. I guess that could be useful if there are multiple GPUs on a node, but models are not exactly the same? |
@elezar Would you be interested if I submit a PR? I figured out that it is not that difficult to expose the UUIDs by leveraging existing functions. The label would look like: |
In my case, we need to schedule pod to specific gpu . we has a map record all gpus(uuid), assign gpu to different job. we run pod with env as below to make sure pod use specific gpu.
However, if the pod is scheduled to a host that does not own the specified GPU, the pod will fail to run and return an error as follows:
Therefore, we must record the mapping between node names and GPU UUIDs. When running a pod, both the node name label and the GPU UUID environment variable should be specified. It would be even better if the GPU UUID could be directly provided from the node label. |
@xiongzubiao |
@shan100github Thanks, I am aware of it. In my case I need to know the device UUIDs without querying DCGM exporter or Prometheus. It is the best that it comes from node label as it is a node property. |
I know it is easy to get it with nvidia-smi. It would be nice that the gpu-feature-discovery exposes it as a label of nodes, so that one doesn't need to ssh into the node.
The text was updated successfully, but these errors were encountered: