Graphical processing units (GPUs) are often used for compute-intensive workloads such as graphics and visualization workloads. AKS supports the creation of GPU-enabled node pools to run these compute-intensive workloads in Kubernetes
- GPU-enabled node pool on AKS
To enable GPU on your Kubeflow cluster, follow the instructions on how to use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)
For AKS cluster with existing CPU node pool, you can add another node pool with GPUs
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--node-vm-size Standard_NC6 \
--name gpu \
--node-count 3
- Create and add multiple node pools to AKS
- See all available GPU optimized VM size
-
Install NVIDIA drivers
Install NVIDIA drivers in Kubernetes namespace to deploy DaemonSet for the NVIDIA device plugin. This DaemonSet runs a pod on each node to provide the required drivers for the GPUs.
kubectl create namespace gpu-resources
kubectl apply -f nvidia-device-plugin-ds.yaml
Set the node selector constraint on ContainerOp which sets the pod specification of the component to run pod on specific device matching key-value label.
import kfp.dsl as dsl
gpu_op = dsl.ContainerOp(name='gpu-op', ...)
gpu_op.add_node_selector_constraint('accelerator', 'nvidia')
GPU limit can be set with set_gpu_limit() on ContainerOp.
import kfp.dsl as dsl
gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)
This repository has a sample component (gpu-op) that runs Pytorch container that checks and print GPU(CUDA) device specification.