From eac635e43978eaa8a9a0a821281603e256bd7eae Mon Sep 17 00:00:00 2001 From: Tuomas Katila Date: Fri, 16 Sep 2022 15:24:46 +0300 Subject: [PATCH 1/2] gpu: fix documentation links Signed-off-by: Tuomas Katila --- cmd/gpu_nfdhook/README.md | 2 +- cmd/gpu_plugin/README.md | 5 ++++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/cmd/gpu_nfdhook/README.md b/cmd/gpu_nfdhook/README.md index 9cc558d85..2735f8470 100644 --- a/cmd/gpu_nfdhook/README.md +++ b/cmd/gpu_nfdhook/README.md @@ -40,7 +40,7 @@ Following labels are created by default. You may turn numeric labels into extend name | type | description| -----|------|------| |`gpu.intel.com/millicores`| number | node GPU count * 1000. Can be used as a finer grained shared execution fraction. -|`gpu.intel.com/memory.max`| number | sum of detected [GPU memory amounts](#GPU-memory) in bytes OR environment variable value * GPU count +|`gpu.intel.com/memory.max`| number | sum of detected [GPU memory amounts](#gpu-memory) in bytes OR environment variable value * GPU count |`gpu.intel.com/cards`| string | list of card names separated by '`.`'. The names match host `card*`-folders under `/sys/class/drm/`. Deprecated, use `gpu-numbers`. |`gpu.intel.com/gpu-numbers`| string | list of numbers separated by '`.`'. The numbers correspond to device file numbers for the primary nodes of given GPUs in kernel DRI subsystem, listed as `/dev/dri/card` in devfs, and `/sys/class/drm/card` in sysfs. |`gpu.intel.com/tiles`| number | sum of all detected GPU tiles in the system. diff --git a/cmd/gpu_plugin/README.md b/cmd/gpu_plugin/README.md index 255b168e8..9dbc42ea6 100644 --- a/cmd/gpu_plugin/README.md +++ b/cmd/gpu_plugin/README.md @@ -6,7 +6,10 @@ Table of Contents * [Modes and Configuration Options](#modes-and-configuration-options) * [Installation](#installation) * [Pre-built Images](#pre-built-images) - * [Fractional Resources](#fractional-resources) + * [Install to all nodes](#install-to-all-nodes) + * [Install to nodes with Intel GPUs with NFD](#install-to-nodes-with-intel-gpus-with-nfd) + * [Install to nodes with Intel GPUs with Fractional resources](#install-to-nodes-with-intel-gpus-with-fractional-resources) + * [Fractional resources details](#fractional-resources-details) * [Verify Plugin Registration](#verify-plugin-registration) * [Testing and Demos](#testing-and-demos) * [Issues with media workloads on multi-GPU setups](#issues-with-media-workloads-on-multi-gpu-setups) From 9b3ee06cb19086436888b06f2d0792eb5c139aee Mon Sep 17 00:00:00 2001 From: Eero Tamminen Date: Fri, 9 Sep 2022 17:10:07 +0300 Subject: [PATCH 2/2] Add GPU plugin README prerequisites section Signed-off-by: Eero Tamminen --- cmd/gpu_plugin/README.md | 83 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 78 insertions(+), 5 deletions(-) diff --git a/cmd/gpu_plugin/README.md b/cmd/gpu_plugin/README.md index 9dbc42ea6..5ff09e224 100644 --- a/cmd/gpu_plugin/README.md +++ b/cmd/gpu_plugin/README.md @@ -5,6 +5,11 @@ Table of Contents * [Introduction](#introduction) * [Modes and Configuration Options](#modes-and-configuration-options) * [Installation](#installation) + * [Prerequisites](#prerequisites) + * [Drivers for discrete GPUs](#drivers-for-discrete-gpus) + * [Kernel driver](#kernel-driver) + * [User-space drivers](#user-space-drivers) + * [Drivers for older (integrated) GPUs](#drivers-for-older-integrated-gpus) * [Pre-built Images](#pre-built-images) * [Install to all nodes](#install-to-all-nodes) * [Install to nodes with Intel GPUs with NFD](#install-to-nodes-with-intel-gpus-with-nfd) @@ -19,7 +24,8 @@ Table of Contents ## Introduction Intel GPU plugin facilitates Kubernetes workload offloading by providing access to -discrete (including IntelĀ® Data Center GPU Flex Series) and integrated Intel GPU device files. +discrete (including IntelĀ® Data Center GPU Flex Series) and integrated Intel GPU devices +supported by the host kernel. Use cases include, but are not limited to: - Media transcode @@ -50,6 +56,73 @@ The following sections detail how to obtain, build, deploy and test the GPU devi Examples are provided showing how to deploy the plugin either using a DaemonSet or by hand on a per-node basis. +### Prerequisites + +Access to a GPU device requires firmware, kernel and user-space +drivers supporting it. Firmware and kernel driver need to be on the +host, user-space drivers in the GPU workload containers. + +Intel GPU devices supported by the current kernel can be listed with: +``` +$ grep i915 /sys/class/drm/card?/device/uevent +/sys/class/drm/card0/device/uevent:DRIVER=i915 +/sys/class/drm/card1/device/uevent:DRIVER=i915 +``` + +#### Drivers for discrete GPUs + +##### Kernel driver + +For now, kernel needs to be built from sources. Later on there will +also be pre-built kernels and/or DKMS GPU module distro packages for +the enterprise / long-term-support kernels. + +While last 5.x upstream Linux kernel releases already had preliminary +discrete Intel GPU support, one should really use kernel v6.x. + +In upstream kernels, discrete GPU support needs to be enabled with kernel +`i915.force_probe=` command line option until relevant kernel +driver features have been completed in upstream: +https://www.kernel.org/doc/html/latest/gpu/rfc/index.html + +PCI IDs for the Intel GPUs on given host can be listed with: +``` +$ lspci | grep -e VGA -e Display | grep Intel +88:00.0 Display controller: Intel Corporation Device 56c1 (rev 05) +8d:00.0 Display controller: Intel Corporation Device 56c1 (rev 05) +``` + +(`lspci` lists GPUs with display support as "VGA compatible controller", +and server GPUs without display support, as "Display controller".) + +Mesa "Iris" 3D driver header provides a mapping between GPU PCI IDs and their Intel brand names: +https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/include/pci_ids/iris_pci_ids.h + +If your kernel build does not find the correct firmware version for +a given GPU from the host (see `dmesg | grep i915` output), latest +firmware versions are available in upstream: +https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915 + +##### User-space drivers + +Until new enough user-space drivers (supporting also discrete GPUs) +are available directly from distribution package repositories, they +can be installed to containers from Intel package repositories. See: +https://dgpu-docs.intel.com/installation-guides/index.html + +Example container is listed in [Testing and demos](#testing-and-demos). + +Validation status against *upstream* kernel is listed in the user-space drivers release notes: +* Media driver: https://github.com/intel/media-driver/releases +* Compute driver: https://github.com/intel/compute-runtime/releases + +#### Drivers for older (integrated) GPUs + +For the older (integrated) GPUs, new enough firmware and kernel driver +are typically included already with the host OS, and new enough +user-space drivers (for the GPU containers) are in the host OS +repositories. + ### Pre-built Images [Pre-built images](https://hub.docker.com/r/intel/intel-gpu-plugin) @@ -155,8 +228,8 @@ master ## Testing and Demos We can test the plugin is working by deploying an OpenCL image and running `clinfo`. -The sample OpenCL image can be built using `make intel-opencl-icd` and must be made -available in the cluster. +[intel-opencl-icd](../../demo/intel-opencl-icd/) sample OpenCL image, built using +`make intel-opencl-icd` and available from DockerHub, is used for this. 1. Create a job: @@ -174,8 +247,8 @@ available in the cluster. ``` - If the pod did not successfully launch, possibly because it could not obtain the gpu - resource, it will be stuck in the `Pending` status: + If the pod did not successfully launch, possibly because it could not obtain + the requested GPU resource, it will be stuck in the `Pending` status: ```bash $ kubectl get pods