-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Resources arent made available after Updated to newest intel-basekitpackages #1940
Comments
Hi @Serverfrog and thanks for a detailed issue report! Logs seem to indicate plugin working correctly with no errors that would cause things to fail. Installing user space libraries to host shouldn't cause things to fail in containers. Unless I assume things were working before you upgraded the packages? I would also make sure the Pod doesn't set nodeSelector etc. and force the Pod to a node without the resources. @eero-t any ideas? fyi, unless you also run GAS (GPU Aware Scheduling) in your cluster, there's not much benefit on enabling resource manager in GPU plugin. |
@Serverfrog Could you paste here:
(Most people are on holidays this week, so definitive answer may go to next week.) |
@tkatila GAS was enabled as a test from me, after it was already not working, that it maybe that. For example it worked shortly for one pod with using i915_monitoring, but only till that one pod was killed/node restartet. @eero-t Sure! I'm thinking you mean the pod('s) that wont get the CPU Right?
|
Monitoring resource bypasses other GPU related constraints. It's intended for monitoring all GPUs, not for using them.
The whole point of device plugins is NOT needing this (as it basically breaks security and is therefore disallowed in many clusters). It has impact only when container is successfully scheduled and actually running on the node, i.e. not related to this.
Note: neither GAS nor GPU plugin supports heterogeneous GPU nodes [1], ones where are multiple types of GPUs, that's why GPU labeler has labeled the node as having 2x Haswell iGPUs, although it has iGPU & dGPU. That does not explain this problem, but it would be better to disable iGPU to avoid jobs intended for dGPU ending on the slow iGPU, that lacks lot of dGPU features. [1] Intel DRA GPU driver supports such configs, and does not need GAS, but you would need k8s v1.32 to use it, and it's resource requests are a bit more complex to use: https://github.com/intel/intel-resource-drivers-for-kubernetes/
Ok, there should be enough GPU & millicore resources available, you're not requesting GPU memory, so it should be fine... I guess that node is not e.g. tainted?
Tuomas, what's this? @Serverfrog does GPU scheduling work if you:
|
Weird.. everything seems fine from resource point of view. Can you drop GAS so that it doesn't interfere with the scheduling decisions? Make sure to remove the scheduler config part in |
Probably this: https://github.com/onedr0p/intel-gpu-exporter |
Exactly, and thats why i changed exactly that pod (it wasn't working before i tried i951_monitoring) with the Monitoring resource.
Yhea, i know. But i wanted to try if maybe with this i kinda workaroundish get it to work I just Disabled the iGPU (i hate old BIOS, as you can only set the "Primary GPU", not "Disable iGPU"... if there is a primary, there can also be a secondary, and if my dGPU is the Primary, then i would think the iGPU would not be disabled but the secondary one...) I also Removed the Lables and Annotations related to GAS, the resourceManager part and every millicores. Also removed the privileged as this tests did not worked. It seems that disabling the iGPU kinda worked. I mean it makes sense, specially if it says that there are 2x the Haswell iGPU, which is blantly wrong. Which also explained why, in privileged mode where always on the iGPU (but it could also be that i could not configured ffmpeg through that interface correctly to use card0 instead card1... why ever it should use card1 over card0) But the i915_monitoring still throws the same error, i think i will revert it back to the normal one, but still, later i kinda wanted to use the xpumanager to export that way the stats. But i as far is read up, there i should/could use the i915_monitoring Resource Edit: i think it was most likely the iGPU, as this was card0 and renderD128 before the reboot and the card0 and renderD128, which most likely confused things up |
As the GPU plugin officially only supports one type of GPU per node, the labeling rules do not work with multiple types of GPUs. For example in the count label:
The @Serverfrog to summarize, your workload now works with the |
Legacy media APIs are kind of stupid compared to compute & 3D APIs, see: https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/README.md#issues-with-media-workloads-on-multi-gpu-setups
There's only single monitoring resource per node. Make sure that no other pod is already consuming it.
GPU plugin should not be confused by that, as it matches But your media application could be confused, see above link for a helper script. |
exactly. Only one would be used, for the GPU exporter.
I can't really attest if it was really the case (as if the application honors the configuration, but it should) that either |
And to double check, if you enable the monitoring and deploy a Pod with the i915_monitoring resource, the Pod won't get scheduled due to missing resources (=i915_monitoring)?
That's how the device plugin works. Cards on the host are mounted to the container without modifications. card1 -> card1, renderD128 -> renderD128 etc. |
Describe the bug
After i updated the newest intel-basekit packages that are on debian (2025.0.1-45) its currently for me not possible to schedule pods to GPU's because
Allocate failed due to requested number of devices unavailable for gpu.intel.com/i915. Requested: 1, Available: 0, which is unexpected
System (please complete the following information):
Additional context
The Node itself has in its status a Capacity and allocatable numbers for gpu.intel.com/i915 as i configured the sharedDevNum.
The intel-gpu-plugin Pod also sets them.
Here is a log ouput with Loglevel 5
The text was updated successfully, but these errors were encountered: