Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Resources arent made available after Updated to newest intel-basekitpackages #1940

Open
Serverfrog opened this issue Dec 23, 2024 · 11 comments

Comments

@Serverfrog
Copy link

Serverfrog commented Dec 23, 2024

Describe the bug
After i updated the newest intel-basekit packages that are on debian (2025.0.1-45) its currently for me not possible to schedule pods to GPU's because Allocate failed due to requested number of devices unavailable for gpu.intel.com/i915. Requested: 1, Available: 0, which is unexpected

System (please complete the following information):

  • OS version: Debian 12.8
  • Kernel version: Linux 6.8.12-5-pve # 1 SMP PREEMPT_DYNAMIC PMX 6.8.12-5 (2024-12-03T10:26Z) x86_64 GNU/Linux
  • Device plugins version: intel/intel-gpu-plugin:0.31.1
  • Hardware info:
  • CPU: Intel(R) Core(TM) i7-4790
  • GPU: Intel Arc A770

Additional context
The Node itself has in its status a Capacity and allocatable numbers for gpu.intel.com/i915 as i configured the sharedDevNum.
The intel-gpu-plugin Pod also sets them.
Here is a log ouput with Loglevel 5

I1223 03:27:06.949113       1 gpu_plugin.go:799] GPU device plugin started with none preferred allocation policy
I1223 03:27:06.949917       1 gpu_plugin_resource_manager.go:174] GPU device plugin resource manager enabled
I1223 03:27:06.950005       1 gpu_plugin_resource_manager.go:311] Requesting pods from kubelet (https://192.168.178.118:10250/pods)
W1223 03:27:06.959157       1 gpu_plugin_resource_manager.go:315] Failed to read pods from kubelet API: Get "https://192.168.178.118:10250/pods": tls: failed to verify certificate: x509: certificate signed by unknown authority
I1223 03:27:06.959191       1 gpu_plugin_resource_manager.go:180] Not using Kubelet API
I1223 03:27:06.959250       1 gpu_plugin.go:835] NFD feature file location: /etc/kubernetes/node-feature-discovery/features.d/intel-gpu-resources.txt
I1223 03:27:06.959282       1 gpu_plugin.go:518] GPU (i915/xe) resource share count = 120
I1223 03:27:06.959450       1 gpu_plugin.go:565] Not compatible device: card0-DP-1
I1223 03:27:06.959463       1 gpu_plugin.go:565] Not compatible device: card0-DP-2
I1223 03:27:06.959471       1 gpu_plugin.go:565] Not compatible device: card0-DP-3
I1223 03:27:06.959478       1 gpu_plugin.go:565] Not compatible device: card0-HDMI-A-2
I1223 03:27:06.959485       1 gpu_plugin.go:565] Not compatible device: card0-HDMI-A-3
I1223 03:27:06.959491       1 gpu_plugin.go:565] Not compatible device: card0-HDMI-A-4
I1223 03:27:06.959498       1 gpu_plugin.go:565] Not compatible device: card0-HDMI-A-5
I1223 03:27:06.959564       1 gpu_plugin.go:565] Not compatible device: card1-HDMI-A-1
I1223 03:27:06.959574       1 gpu_plugin.go:565] Not compatible device: card1-VGA-1
I1223 03:27:06.959579       1 gpu_plugin.go:565] Not compatible device: renderD128
I1223 03:27:06.959584       1 gpu_plugin.go:565] Not compatible device: renderD129
I1223 03:27:06.959583       1 labeler.go:480] Starting GPU labeler
I1223 03:27:06.959591       1 gpu_plugin.go:565] Not compatible device: version
I1223 03:27:06.959724       1 labeler.go:219] tile files found:[/sys/class/drm/card0/gt/gt0]
I1223 03:27:06.959792       1 gpu_plugin.go:636] Adding /dev/dri/card0 to GPU card0
I1223 03:27:06.959804       1 gpu_plugin.go:636] Adding /dev/dri/renderD129 to GPU card0
I1223 03:27:06.960346       1 gpu_plugin.go:726] For i915_monitoring/all, adding nodes: [{ContainerPath:/dev/dri/card0 HostPath:/dev/dri/card0 Permissions:rw XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {ContainerPath:/dev/dri/renderD129 HostPath:/dev/dri/renderD129 Permissions:rw XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}]
I1223 03:27:06.960483       1 labeler.go:219] tile files found:[/sys/class/drm/card1/gt/gt0]
I1223 03:27:06.960532       1 gpu_plugin.go:636] Adding /dev/dri/card1 to GPU card1
I1223 03:27:06.960546       1 gpu_plugin.go:636] Adding /dev/dri/renderD128 to GPU card1
I1223 03:27:06.960977       1 gpu_plugin.go:726] For i915_monitoring/all, adding nodes: [{ContainerPath:/dev/dri/card1 HostPath:/dev/dri/card1 Permissions:rw XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {ContainerPath:/dev/dri/renderD128 HostPath:/dev/dri/renderD128 Permissions:rw XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}]
I1223 03:27:07.047963       1 gpu_plugin.go:540] GPU scan update: 0->240 'i915' resources found
I1223 03:27:07.047976       1 gpu_plugin.go:540] GPU scan update: 0->1 'i915_monitoring' resources found
I1223 03:27:07.047993       1 labeler.go:495] Ext resources scanning
I1223 03:27:07.048130       1 labeler.go:122] Not compatible devicecard0-DP-1
I1223 03:27:07.048140       1 labeler.go:122] Not compatible devicecard0-DP-2
I1223 03:27:07.048146       1 labeler.go:122] Not compatible devicecard0-DP-3
I1223 03:27:07.048153       1 labeler.go:122] Not compatible devicecard0-HDMI-A-2
I1223 03:27:07.048159       1 labeler.go:122] Not compatible devicecard0-HDMI-A-3
I1223 03:27:07.048165       1 labeler.go:122] Not compatible devicecard0-HDMI-A-4
I1223 03:27:07.048171       1 labeler.go:122] Not compatible devicecard0-HDMI-A-5
I1223 03:27:07.048250       1 labeler.go:122] Not compatible devicecard1-HDMI-A-1
I1223 03:27:07.048259       1 labeler.go:122] Not compatible devicecard1-VGA-1
I1223 03:27:07.048264       1 labeler.go:122] Not compatible devicerenderD128
I1223 03:27:07.048270       1 labeler.go:122] Not compatible devicerenderD129
I1223 03:27:07.048276       1 labeler.go:122] Not compatible deviceversion
I1223 03:27:07.048442       1 labeler.go:219] tile files found:[/sys/class/drm/card0/gt/gt0]
W1223 03:27:07.048482       1 labeler.go:176] Can't read file: open /sys/class/drm/card0/lmem_total_bytes: no such file or directory
I1223 03:27:07.048693       1 labeler.go:219] tile files found:[/sys/class/drm/card1/gt/gt0]
W1223 03:27:07.048728       1 labeler.go:176] Can't read file: open /sys/class/drm/card1/lmem_total_bytes: no such file or directory
I1223 03:27:07.048797       1 labeler.go:505] Writing labels
I1223 03:27:07.048013       1 manager.go:115] Received dev updates:{map[i ..  (shortened due to maximum charactrs) ... ]}
I1223 03:27:08.148626       1 server.go:285] Start server for i915 at: /var/lib/kubelet/device-plugins/gpu.intel.com-i915.sock
I1223 03:27:08.148633       1 server.go:285] Start server for i915_monitoring at: /var/lib/kubelet/device-plugins/gpu.intel.com-i915_monitoring.sock
I1223 03:27:08.159803       1 server.go:128] Started ListAndWatch fori915
I1223 03:27:08.159822       1 server.go:117] Sending to kubelet[]
I1223 03:27:08.159909       1 server.go:303] Device plugin for i915 registered
I1223 03:27:08.160059       1 server.go:117] Sending to kubelet[&Device{ID:card1-33,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-46,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-62,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-72,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-51,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-116,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-41,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-63,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-115,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-61,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-31,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-75,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-33,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-39,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-73,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-108,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-77,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-101,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-19,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-29,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-22,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-71,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-106,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-40,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-62,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-80,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-83,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-17,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-96,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-110,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-30,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-111,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-113,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-14,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-59,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-78,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-95,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-44,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-109,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-100,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-81,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-26,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-68,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-119,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-10,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-24,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-42,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-105,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-46,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-53,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-117,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-10,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-51,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-114,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-30,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-54,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-111,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-49,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-72,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-49,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-119,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-98,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-13,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-99,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-84,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-89,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-91,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-16,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-84,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-89,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-45,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-64,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-27,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-96,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-117,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-17,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-94,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-103,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-95,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-45,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-54,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-55,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-70,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-43,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-58,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-78,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-91,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-32,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-57,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-61,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-74,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-109,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-22,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-56,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-43,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-56,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-25,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-19,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-21,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-40,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-35,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-93,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-69,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-38,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-71,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-79,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-85,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-15,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-87,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-101,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-23,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-47,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-53,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-65,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-118,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-12,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-31,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-76,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-98,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-65,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-42,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-50,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-20,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-67,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-108,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-118,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-8,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-81,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-9,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-74,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-93,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-8,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-59,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-63,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-82,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-104,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-14,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-24,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-48,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-67,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-64,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-90,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-102,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-104,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-50,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-102,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-38,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-94,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-52,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-37,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-86,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-69,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-87,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-66,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-75,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-107,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-83,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-28,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-36,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-11,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-57,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-82,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-58,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-60,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-100,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-32,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-86,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-88,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-79,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-26,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-44,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-52,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-97,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-97,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-16,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-80,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-73,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-9,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-23,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-55,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-116,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-34,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-112,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-92,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-68,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-48,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-66,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-70,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-13,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-47,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-18,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-112,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-12,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-37,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-115,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-18,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-34,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-90,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-21,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-113,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-103,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-85,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-36,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-29,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-35,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-76,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-105,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-107,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-41,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-106,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-15,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-28,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-20,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-114,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-25,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-60,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-110,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-77,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-88,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-11,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-39,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-27,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-92,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-99,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},}]
I1223 03:27:08.248821       1 server.go:303] Device plugin for i915_monitoring registered
I1223 03:27:08.248987       1 server.go:128] Started ListAndWatch fori915_monitoring
I1223 03:27:08.248997       1 server.go:117] Sending to kubelet[]
I1223 03:27:08.249032       1 server.go:117] Sending to kubelet[&Device{ID:all,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},}]
I1223 03:27:11.949496       1 gpu_plugin.go:565] Not compatible device: card0-DP-1
@tkatila
Copy link
Contributor

tkatila commented Dec 23, 2024

Hi @Serverfrog and thanks for a detailed issue report! Logs seem to indicate plugin working correctly with no errors that would cause things to fail.

Installing user space libraries to host shouldn't cause things to fail in containers. Unless intel-basekit also installs some misbehaving kernel drivers. But in your case the plugin detects the GPUs correctly.

I assume things were working before you upgraded the packages? I would also make sure the Pod doesn't set nodeSelector etc. and force the Pod to a node without the resources.

@eero-t any ideas?

fyi, unless you also run GAS (GPU Aware Scheduling) in your cluster, there's not much benefit on enabling resource manager in GPU plugin.

@eero-t
Copy link
Contributor

eero-t commented Dec 23, 2024

@Serverfrog Could you paste here:

  • pod spec, at least following sections: nodename/nodeSelector, securityContext (both for pod & container), resources?
  • node k8s GPU info: kubectl describe node YOUR_NODE_NAME | grep gpu
  • node GPU files info: ls -l /dev/dri/ && head /sys/class/drm/card[0-9]/device/uevent

(Most people are on holidays this week, so definitive answer may go to next week.)

@Serverfrog
Copy link
Author

Serverfrog commented Dec 23, 2024

@tkatila GAS was enabled as a test from me, after it was already not working, that it maybe that. For example it worked shortly for one pod with using i915_monitoring, but only till that one pod was killed/node restartet.

@eero-t Sure! I'm thinking you mean the pod('s) that wont get the CPU Right?


...
  nodeName: proxfrog2
  securityContext: {}
  containers:
    - name: app
    ...
      resources:
        limits:
          gpu.intel.com/i915: '1'
          gpu.intel.com/millicores: '10'
          memory: 1536Mi
        requests:
          cpu: 100m
          gpu.intel.com/i915: '1'
          gpu.intel.com/millicores: '10'
          memory: 512Mi
      securityContext:
        privileged: true

privileged: true was added afterwards to test if it could maybe that, but that also didn't worked

❯ kubectl describe node proxfrog2 | grep gpu
                    gas-prefer-gpu=card0
                    gpu.intel.com/device-id.0300-0412.count=2
                    gpu.intel.com/device-id.0300-0412.present=true
                    gpu.intel.com/device-id.0300-56a0.present=true
                    gpu.intel.com/family=A_Series
                    intel.feature.node.kubernetes.io/gpu=true
                    nfd.node.kubernetes.io/extended-resources: gpu.intel.com/memory.max,gpu.intel.com/millicores,gpu.intel.com/tiles
                      gpu.intel.com/device-id.0300-0412.count,gpu.intel.com/device-id.0300-0412.present,gpu.intel.com/device-id.0300-56a0.present,gpu.intel.com/...
  gpu.intel.com/i915:             240
  gpu.intel.com/i915_monitoring:  1
  gpu.intel.com/memory.max:       0
  gpu.intel.com/millicores:       2k
  gpu.intel.com/tiles:            2
  gpu.intel.com/i915:             240
  gpu.intel.com/i915_monitoring:  1
  gpu.intel.com/memory.max:       0
  gpu.intel.com/millicores:       2k
  gpu.intel.com/tiles:            2
  kube-system                 intel-gpu-exporter-wwf72                                  100m (1%)     0 (0%)      100Mi (0%)       500Mi (1%)     6h31m
  kube-system                 intel-gpu-plugin-intel-gpu-plugin-wmkkd                   40m (0%)      100m (1%)   45Mi (0%)        90Mi (0%)      6h34m
  gpu.intel.com/i915             3                 3
  gpu.intel.com/i915_monitoring  1                 1
  gpu.intel.com/memory.max       0                 0
  gpu.intel.com/millicores       40                40
  gpu.intel.com/tiles            0                 0

 ⚡ root@proxfrog2  ~  ls -l /dev/dri/ && head /sys/class/drm/card[0-9]/device/uevent
total 0
drwxr-xr-x 2 root root        120 Dec 22 17:34 by-path
crw-rw---- 1 root video  226,   0 Dec 22 17:34 card0
crw-rw---- 1 root video  226,   1 Dec 22 17:34 card1
crw-rw---- 1 root render 226, 128 Dec 22 17:34 renderD128
crw-rw---- 1 root render 226, 129 Dec 22 17:34 renderD129
==> /sys/class/drm/card0/device/uevent <==
DRIVER=i915
PCI_CLASS=30000
PCI_ID=8086:56A0
PCI_SUBSYS_ID=172F:3937
PCI_SLOT_NAME=0000:03:00.0
MODALIAS=pci:v00008086d000056A0sv0000172Fsd00003937bc03sc00i00

==> /sys/class/drm/card1/device/uevent <==
DRIVER=i915
PCI_CLASS=30000
PCI_ID=8086:0412
PCI_SUBSYS_ID=1043:8534
PCI_SLOT_NAME=0000:00:02.0
MODALIAS=pci:v00008086d00000412sv00001043sd00008534bc03sc00i00


⚡ root@proxfrog2  ~  tree /dev/dri
/dev/dri
├── by-path
│   ├── pci-0000:00:02.0-card -> ../card1
│   ├── pci-0000:00:02.0-render -> ../renderD128
│   ├── pci-0000:03:00.0-card -> ../card0
│   └── pci-0000:03:00.0-render -> ../renderD129
├── card0
├── card1
├── renderD128
└── renderD129

@eero-t
Copy link
Contributor

eero-t commented Dec 23, 2024

@tkatila GAS was enabled as a test from me, after it was already not working, that it maybe that. For example it worked shortly for one pod with using i915_monitoring, but only till that one pod was killed/node restartet.

Monitoring resource bypasses other GPU related constraints. It's intended for monitoring all GPUs, not for using them.

privileged: true was added afterwards to test if it could maybe that, but that also didn't worked

The whole point of device plugins is NOT needing this (as it basically breaks security and is therefore disallowed in many clusters). It has impact only when container is successfully scheduled and actually running on the node, i.e. not related to this.

                gpu.intel.com/device-id.0300-0412.count=2
                gpu.intel.com/device-id.0300-0412.present=true
                gpu.intel.com/device-id.0300-56a0.present=true

Note: neither GAS nor GPU plugin supports heterogeneous GPU nodes [1], ones where are multiple types of GPUs, that's why GPU labeler has labeled the node as having 2x Haswell iGPUs, although it has iGPU & dGPU.

That does not explain this problem, but it would be better to disable iGPU to avoid jobs intended for dGPU ending on the slow iGPU, that lacks lot of dGPU features.

[1] Intel DRA GPU driver supports such configs, and does not need GAS, but you would need k8s v1.32 to use it, and it's resource requests are a bit more complex to use: https://github.com/intel/intel-resource-drivers-for-kubernetes/

gpu.intel.com/i915: 240
gpu.intel.com/i915_monitoring: 1
gpu.intel.com/memory.max: 0
gpu.intel.com/millicores: 2k
gpu.intel.com/tiles: 2
gpu.intel.com/i915: 240
gpu.intel.com/i915_monitoring: 1
gpu.intel.com/memory.max: 0
gpu.intel.com/millicores: 2k
gpu.intel.com/tiles: 2

Ok, there should be enough GPU & millicore resources available, you're not requesting GPU memory, so it should be fine...

I guess that node is not e.g. tainted?

kube-system intel-gpu-exporter-wwf72 100m (1%) 0 (0%) 100Mi (0%) 500Mi (1%) 6h31m

Tuomas, what's this?

@Serverfrog does GPU scheduling work if you:

  • drop GAS and disable resource management support from plugin, and/or
  • disable the iGPU
    ?

@tkatila
Copy link
Contributor

tkatila commented Dec 23, 2024

Weird.. everything seems fine from resource point of view.

Can you drop GAS so that it doesn't interfere with the scheduling decisions? Make sure to remove the scheduler config part in /etc/kubernetes/manifests/kube-scheduler.yaml. Before removing GAS, you could check if there's anything funny in GAS' logs.
edit: as noted by Eero, also remove resource management from GPU plugin.

@tkatila
Copy link
Contributor

tkatila commented Dec 23, 2024

kube-system intel-gpu-exporter-wwf72 100m (1%) 0 (0%) 100Mi (0%) 500Mi (1%) 6h31m

Tuomas, what's this?

Probably this: https://github.com/onedr0p/intel-gpu-exporter

@Serverfrog
Copy link
Author

Serverfrog commented Dec 23, 2024

Monitoring resource bypasses other GPU related constraints. It's intended for monitoring all GPUs, not for using them.

kube-system intel-gpu-exporter-wwf72 100m (1%) 0 (0%) 100Mi (0%) 500Mi (1%) 6h31m
Tuomas, what's this?
Probably this: https://github.com/onedr0p/intel-gpu-exporter

Exactly, and thats why i changed exactly that pod (it wasn't working before i tried i951_monitoring) with the Monitoring resource.

privileged: true was added afterwards to test if it could maybe that, but that also didn't worked

The whole point of device plugins is NOT needing this (as it basically breaks security and is therefore disallowed in many clusters). It has impact only when container is successfully scheduled and actually running on the node, i.e. not related to this.

Yhea, i know. But i wanted to try if maybe with this i kinda workaroundish get it to work

I just Disabled the iGPU (i hate old BIOS, as you can only set the "Primary GPU", not "Disable iGPU"... if there is a primary, there can also be a secondary, and if my dGPU is the Primary, then i would think the iGPU would not be disabled but the secondary one...)

I also Removed the Lables and Annotations related to GAS, the resourceManager part and every millicores. Also removed the privileged as this tests did not worked.

It seems that disabling the iGPU kinda worked. I mean it makes sense, specially if it says that there are 2x the Haswell iGPU, which is blantly wrong. Which also explained why, in privileged mode where always on the iGPU (but it could also be that i could not configured ffmpeg through that interface correctly to use card0 instead card1... why ever it should use card1 over card0)

But the i915_monitoring still throws the same error, i think i will revert it back to the normal one, but still, later i kinda wanted to use the xpumanager to export that way the stats. But i as far is read up, there i should/could use the i915_monitoring Resource

Edit: i think it was most likely the iGPU, as this was card0 and renderD128 before the reboot and the card0 and renderD128, which most likely confused things up

@tkatila
Copy link
Contributor

tkatila commented Dec 30, 2024

As the GPU plugin officially only supports one type of GPU per node, the labeling rules do not work with multiple types of GPUs. For example in the count label:

gpu.intel.com/device-id.0300-0412.count=2

The 0300-0412 part is taken from the first (I think) PCI device it processes. The rules do not create multiple entries as it would multiply the number of rules needed (or a custom labeling binary). So even though the labels indicate that there were two 0412 devices, it's actually the 0412 + 56a0. The label name itself is just wrong.

@Serverfrog to summarize, your workload now works with the i915 resource, but the a pod requesting i915_monitor fails?

@eero-t
Copy link
Contributor

eero-t commented Dec 30, 2024

Which also explained why, in privileged mode where always on the iGPU (but it could also be that i could not configured ffmpeg through that interface correctly to use card0 instead card1... why ever it should use card1 over card0)

Legacy media APIs are kind of stupid compared to compute & 3D APIs, see: https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/README.md#issues-with-media-workloads-on-multi-gpu-setups

But the i915_monitoring still throws the same error, i think i will revert it back to the normal one, but still, later i kinda wanted to use the xpumanager to export that way the stats. But i as far is read up, there i should/could use the i915_monitoring Resource

There's only single monitoring resource per node. Make sure that no other pod is already consuming it.

Edit: i think it was most likely the iGPU, as this was card0 and renderD128 before the reboot and the card0 and renderD128, which most likely confused things up

GPU plugin should not be confused by that, as it matches card & renderD device file nodes correctly based on info from sysfs.

But your media application could be confused, see above link for a helper script.

@Serverfrog
Copy link
Author

@Serverfrog to summarize, your workload now works with the i915 resource, but the a pod requesting i915_monitor fails?

exactly.

Only one would be used, for the GPU exporter.

GPU plugin should not be confused by that, as it matches card & renderD device file nodes correctly based on info from sysfs.

I can't really attest if it was really the case (as if the application honors the configuration, but it should) that either card0 and 1 and renderD devices, were both the iGPU, why ever.
I was specially confused as im thinking that in the end the entire device plugin would just mount the devices directly from host, like if i only wanted to use the host card1 device, i would also only have a card1 inside the container, without the card0 even if it exists

@tkatila
Copy link
Contributor

tkatila commented Jan 6, 2025

@Serverfrog to summarize, your workload now works with the i915 resource, but the a pod requesting i915_monitor fails?

exactly.

Only one would be used, for the GPU exporter.

And to double check, if you enable the monitoring and deploy a Pod with the i915_monitoring resource, the Pod won't get scheduled due to missing resources (=i915_monitoring)?

GPU plugin should not be confused by that, as it matches card & renderD device file nodes correctly based on info from sysfs.

I can't really attest if it was really the case (as if the application honors the configuration, but it should) that either card0 and 1 and renderD devices, were both the iGPU, why ever. I was specially confused as im thinking that in the end the entire device plugin would just mount the devices directly from host, like if i only wanted to use the host card1 device, i would also only have a card1 inside the container, without the card0 even if it exists

That's how the device plugin works. Cards on the host are mounted to the container without modifications. card1 -> card1, renderD128 -> renderD128 etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants