You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running GPU operator on our EKS clusters and are working to upgrade them to v1.30 (and subsequently to v1.31). GPU nodes were working fine on kubernetes v1.29
We Upgraded the cluster control plane to v1.30 and followed the following steps to upgrade our GPU nodes:
Make sure gpu-operator version is compatible with Kubernetes version
Upgrade the gpu node group with the AMI obtained from the above step. The AMI is ubuntu 22.04 with kernel version 6.5 (we tried with an AMI with kernel version 6.8 as well). Ideally any ubuntu 22.04 x86_64 AMI in the list should work with the gpu operator just fine.
We started seeing gpu operator pods going into Error and CrashLoopBackOff state. We created a new nodepool on k8s v1.29 with older configs to reduce disruption of workloads and kept one node on v1.30 for testing.
These pods keep terminating, crashing and getting recreated over and over.
Here are some more logs and info that might help:
Describe nvidia-operator-validator pod - Events show this error
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 107s default-scheduler Successfully assigned gpu-operator/nvidia-operator-validator-fqt2z to ip-10-2-87-217.ec2.internal
Warning FailedCreatePodSandBox 1s (x9 over 107s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Logs of driver daemonset pod:
$ kubectl logs -n gpu-operator nvidia-driver-daemonset-bn5hn -f
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.183.01
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.183.01........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 535.183.01 for Linux kernel version 6.5.0-1020-aws
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Pod terminates and crashes after this.
I would appreciate any help in figuring why this is happening and what AMI versions/kernel versions can we use to mitigate this.
Finding out what is the kernel version of any given AMI is difficult. If there is a strict kernel version requirement, there should be a mention in the documentation where the user can find the proper AMI for it.
We also plan to use pre-compiled driver images going forward, but again, no kernel version newer than 5.15 is supported. We are having trouble finding an AMI which is Ubuntu 22.04, has the 5.15 kernel and is compatible with k8s v1.30 and v1.31.
It's a bit confusing from the docs that, for the version mentioned by you, under the supported operating systems and Kubernetes platforms here, it's mentioned, under the Cloud Service Providers tab, that EKS is supported from v1.25-v1.28. I really doubt if it's the case, since it worked fine on v1.29 for you and only started failing with v1.30. If the document is correct, I will have to think multiple times before upgrading to any version beyond 1.28.
The compatibility issue in my opinion is the kernel version. Nvidia does not provide driver support (either normal or pre-compiled) for any kernel version > 5.15 and Ubuntu does not provide an AMI which is both compatible with k8s v1.30+ AND has kernel v5.15! Same case for pre-compiled drivers
So my question here is: Is there NO WAY to run gpu-operator managed clusters reliably on k8s v1.30 and above??
We are running GPU operator on our EKS clusters and are working to upgrade them to v1.30 (and subsequently to v1.31). GPU nodes were working fine on kubernetes v1.29
We Upgraded the cluster control plane to v1.30 and followed the following steps to upgrade our GPU nodes:
We started seeing gpu operator pods going into
Error
andCrashLoopBackOff
state. We created a new nodepool on k8s v1.29 with older configs to reduce disruption of workloads and kept one node on v1.30 for testing.Basic details:
Here is the pod status on the v1.30 node:
These pods keep terminating, crashing and getting recreated over and over.
Here are some more logs and info that might help:
Describe
nvidia-operator-validator
pod - Events show this errorLogs of driver daemonset pod:
Pod terminates and crashes after this.
I would appreciate any help in figuring why this is happening and what AMI versions/kernel versions can we use to mitigate this.
5.15
and6.5
.5.15
no longer ships in AMIs for k8s v1.30 and we tested with6.5
and it does not work. In Amazon EKS specific docs, there is no mention of kernel version requirements and it states as long as you have a ubuntu 22.04 x86_64 image, you're good.The text was updated successfully, but these errors were encountered: