-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU product node label supports only one product type #1659
Comments
Hi @brgavino The GPU plugin is supposed to be run on homogeneous nodes (only one type of GPU). As both cards appear as same resource ( Does your use case demand having different types of GPUs running on the same node? |
It is common to have two types of GPUs mixed in usage, when this node or a cluster is supposed to support different workloads. GAS should have a way to differentiate 170 and 140, which may be better and more cost-efficient than the other one. While the formal solution may take a long time, can we have a quick workaround? |
If you are referring to GPU Aware Scheduling with GAS, it states in its README that it expects cluster to be homogeneous. |
One way that we have considered this is to present a stand-in, such as requesting |
"For not-so-quick solutions, we could name the resource by product type." |
I'm not sure I understand. Resource is what is requested by the pod, and what the devices are associate with. Whereas labels are associated only with nodes (not devices), and used by k8s scheduler for limiting set of eligible nodes for running given pod. Or did you meant annotating the pod for GAS? |
Your thought process is valid, but that's not how GAS works. As it assumes the nodes to be homogeneous, the "memory.max" value is divided between the GPUs. For a node with Flex 170 and Flex 140 cards, it would consider each card to have ~9.3GB of memory (16+6+6)/3. But. Since this problem seems to be a recurring thing, either multiple dGPUs or iGPU + dGPU, I have planned to implement an alternative resource naming method. There user would be able to provide a "pci device id" to "resource name" mapping. Any GPUs on the mapping would get renamed and the ones not on the mapping would be registered via the default name. It will have down sides: GAS would not support it and Pod specs would need adaptation. Maybe something else as well. Opinions are welcome. |
Sounds good to me. While GAS could use the same PCI ID <-> resource name mapping (if it's specified e.g. in a shared/secured configMap) to know about the devices, adding support for per-node devices with different amount of resource could be very large effort. Probably better to handle that with DRA driver. @uniemimu ? |
Correct me if I got this wrong - The problem seems a limitation of the naming of Flex GPU in the NFD. It cannot tell two types (or more types) of GPU nodes in one server node. So GAS cannot schedule the pod to use a specific type of GPU. Adding resource name mapping seems a little bit overkill in this case -- how many resources would be needed? |
It's not about the labels. Let's say we had the labels correct in a sense you'd like them to be. A node has two GPUs Flex 140 and Flex 170:
The pod spec would declare nodeAffinity to GAS doesn't help either, because it only pre-selects the GPU to use by analyzing the extended resources (mllicores, memory.max). GAS doesn't know if card0 is Flex 140 or Flex 170. Though, it would be possible to add that kind of support by extending the node labels etc. |
It does look like the DRA driver would be a better fit in the long term. Based on the discussion here, it seems that DRA is the preferred path forward and the device plugins for GPU are legacy and may not get / be worth implementing such features?
Right; Dividing the memory.max up equally can't help steer workloads on heterogenous hardware configs (current implementation) and there's no way to ask for affinity for devices at the pod level - or for cards to avoid per node in node selectors, taints, etc. So even if the labels were extended, GAS would need to support the label parsing logic to select the right resources based on the pod labels, for example. The high level story here, though, is that "As a GPU workload, I need to pick the type of GPU I land on". Other questions on specific resources requests seems handled by DRA. |
I believe DRA is the way forward, but it is still alpha in K8s 1.29 and requires a feature flag to work. So device plugins will still be around for quite some time. As your request is not alone, I am leaning towards adding the necessary logic to allow user to rename the GPU resources based on the GPU type. It might hit our next release (~April) depending on how much time I can steal from my other tasks. |
Hi, @tkatila , Thanks for your explanations! Now I get the picture. To summarize, the affinity preference (gpu_type) of a pod level becomes useless after the pod is scheduled to a server node - because the GPU plugins or GAS cannot differentiate the gpu_type of each GPU device in the server. Regarding your solution -- a "pci device id" to "resource name" mapping, do you mean something like pci_device_id to "gpu_type" mapping? |
Yes. Plugin would read device's PCI ID and then based on it apply the default or a custom resource name to it. For example, one could rename an integrated TGL GPU as "gpu.intel.com/i915-tgl" or a Flex 170 (56c0) as "gpu.intel.com/i915-flex170". The editable part would be the "i915" postfix, we need to keep the namespace as it is. I think it would also make sense to provide example mappings for some of the cases. |
Yes. The proposal makes sense. Thanks! |
Another couple of items, that we may want to look at (if these should be tracked as feature requests elsewhere, please let me know):
|
AFAIK that's not going to change. These extended resources are intended to be used with GAS, and it is supposed to only work with homogeneous clusters and it will most likely stay like that.
Yep. That's because the NFD rule counts devices based on their pci class (0380). And the name of the label is taken from the "first" pci device on the list. The good thing is that it's dynamic, so it works most of the times and for all devices. If we'd want to have counts per GPU type, we'll have to add per GPU (device id) rules. It's probably ok to add that for 0380 class devices as there is a limited amount of them. For 0300 the list would be too long. |
Describe the bug
gpu.intel.com/product
only supports one type, such as 'Flex_140' or 'Flex_170'; in the case when both types of cards are installed only 'Flex_140' will be written as the label value (due to order in https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/deployments/nfd/overlays/node-feature-rules/platform-labeling-rules.yaml). This is important when selecting for differing pod preferences, ie a pod may have better performance on Flex 170 than Flex 140, so an affinity for scheduling for nodes with Flex 170 is preferred. Other logic may handle the actual consumption of that resource.To Reproduce
Steps to reproduce the behavior:
platform-labeling-rules.yaml
gpu.intel.com/product=Flex_140
is in labelsExpected behavior
The values for
gpu.intel.com/product
should follow the other labelled values, such asgpu.intel.com/device-id.0380-56c0.present=true
orgpu.intel.com/device-id.0380-56c1.present=true
. At present, pods may select for these labels, which are not transparent to Intel product naming. Suggest something likegpu.intel.com/product.Flex_140.present=true
etc. which could indicate any presence of commonly understood Intel GPU product names.System (please complete the following information):
Additional context
n/a
The text was updated successfully, but these errors were encountered: