gpu-operator breaks when upgrading EKS to K8s v1.30 #1220

runitmisra · 2025-01-22T13:11:28Z

We are running GPU operator on our EKS clusters and are working to upgrade them to v1.30 (and subsequently to v1.31). GPU nodes were working fine on kubernetes v1.29

We Upgraded the cluster control plane to v1.30 and followed the following steps to upgrade our GPU nodes:

Make sure gpu-operator version is compatible with Kubernetes version
Get a supported Ubuntu 22.04 EKS AMI from https://cloud-images.ubuntu.com/aws-eks/ as mentioned in this doc
Upgrade the gpu node group with the AMI obtained from the above step. The AMI is ubuntu 22.04 with kernel version 6.5 (we tried with an AMI with kernel version 6.8 as well). Ideally any ubuntu 22.04 x86_64 AMI in the list should work with the gpu operator just fine.

We started seeing gpu operator pods going into Error and CrashLoopBackOff state. We created a new nodepool on k8s v1.29 with older configs to reduce disruption of workloads and kept one node on v1.30 for testing.

Basic details:

Kubernetes version: v1.30
GPU Operator version: v24.6.2
GPU Driver version: v535.183.01
Ubuntu AMI Name: ubuntu-eks/k8s_1.30/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240526
Kernel Version: 6.5.0-1020-aws

Here is the pod status on the v1.30 node:

$ kubectl get pods -n gpu-operator --field-selector spec.nodeName=ip-10-2-87-217.ec2.internal                                                
NAME                                               READY   STATUS             RESTARTS      AGE
gpu-feature-discovery-r88m2                        0/2     Init:0/2           0             2m30s
gpu-operator-node-feature-discovery-worker-qh45g   1/1     Running            0             4h8m
nvidia-container-toolkit-daemonset-wnmsk           0/1     Init:0/1           0             2m30s
nvidia-dcgm-exporter-4q62c                         0/1     Init:0/1           0             2m30s
nvidia-device-plugin-daemonset-ngvbq               0/2     Init:0/2           0             2m30s
nvidia-driver-daemonset-n8m56                      0/1     CrashLoopBackOff   3 (36s ago)   2m33s
nvidia-operator-validator-mwp76                    0/1     Init:0/4           0             2m30s

These pods keep terminating, crashing and getting recreated over and over.

Here are some more logs and info that might help:
Describe nvidia-operator-validator pod - Events show this error

Events:
  Type     Reason                  Age                From               Message
  ----     ------                  ----               ----               -------
  Normal   Scheduled               107s               default-scheduler  Successfully assigned gpu-operator/nvidia-operator-validator-fqt2z to ip-10-2-87-217.ec2.internal
  Warning  FailedCreatePodSandBox  1s (x9 over 107s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Logs of driver daemonset pod:

$ kubectl logs -n gpu-operator nvidia-driver-daemonset-bn5hn -f                                                                              
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.183.01
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.183.01........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 535.183.01 for Linux kernel version 6.5.0-1020-aws

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Pod terminates and crashes after this.

I would appreciate any help in figuring why this is happening and what AMI versions/kernel versions can we use to mitigate this.

Finding out what is the kernel version of any given AMI is difficult. If there is a strict kernel version requirement, there should be a mention in the documentation where the user can find the proper AMI for it.
The documentation only mentions kernel version requirements once, which includes 5.15 and 6.5. 5.15 no longer ships in AMIs for k8s v1.30 and we tested with 6.5 and it does not work. In Amazon EKS specific docs, there is no mention of kernel version requirements and it states as long as you have a ubuntu 22.04 x86_64 image, you're good.

The text was updated successfully, but these errors were encountered:

runitmisra · 2025-01-23T04:55:00Z

We also plan to use pre-compiled driver images going forward, but again, no kernel version newer than 5.15 is supported. We are having trouble finding an AMI which is Ubuntu 22.04, has the 5.15 kernel and is compatible with k8s v1.30 and v1.31.

mukulgit123 · 2025-01-28T09:19:14Z

It's a bit confusing from the docs that, for the version mentioned by you, under the supported operating systems and Kubernetes platforms here, it's mentioned, under the Cloud Service Providers tab, that EKS is supported from v1.25-v1.28. I really doubt if it's the case, since it worked fine on v1.29 for you and only started failing with v1.30. If the document is correct, I will have to think multiple times before upgrading to any version beyond 1.28.

runitmisra · 2025-01-29T09:46:03Z

The compatibility issue in my opinion is the kernel version. Nvidia does not provide driver support (either normal or pre-compiled) for any kernel version > 5.15 and Ubuntu does not provide an AMI which is both compatible with k8s v1.30+ AND has kernel v5.15! Same case for pre-compiled drivers

So my question here is: Is there NO WAY to run gpu-operator managed clusters reliably on k8s v1.30 and above??

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-operator breaks when upgrading EKS to K8s v1.30 #1220

gpu-operator breaks when upgrading EKS to K8s v1.30 #1220

runitmisra commented Jan 22, 2025

runitmisra commented Jan 23, 2025

mukulgit123 commented Jan 28, 2025

runitmisra commented Jan 29, 2025

gpu-operator breaks when upgrading EKS to K8s v1.30 #1220

gpu-operator breaks when upgrading EKS to K8s v1.30 #1220

Comments

runitmisra commented Jan 22, 2025

runitmisra commented Jan 23, 2025

mukulgit123 commented Jan 28, 2025

runitmisra commented Jan 29, 2025