Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu-operator breaks when upgrading EKS to K8s v1.30 #1220

Open
runitmisra opened this issue Jan 22, 2025 · 3 comments
Open

gpu-operator breaks when upgrading EKS to K8s v1.30 #1220

runitmisra opened this issue Jan 22, 2025 · 3 comments

Comments

@runitmisra
Copy link

We are running GPU operator on our EKS clusters and are working to upgrade them to v1.30 (and subsequently to v1.31). GPU nodes were working fine on kubernetes v1.29

We Upgraded the cluster control plane to v1.30 and followed the following steps to upgrade our GPU nodes:

  • Make sure gpu-operator version is compatible with Kubernetes version
  • Get a supported Ubuntu 22.04 EKS AMI from https://cloud-images.ubuntu.com/aws-eks/ as mentioned in this doc
  • Upgrade the gpu node group with the AMI obtained from the above step. The AMI is ubuntu 22.04 with kernel version 6.5 (we tried with an AMI with kernel version 6.8 as well). Ideally any ubuntu 22.04 x86_64 AMI in the list should work with the gpu operator just fine.

We started seeing gpu operator pods going into Error and CrashLoopBackOff state. We created a new nodepool on k8s v1.29 with older configs to reduce disruption of workloads and kept one node on v1.30 for testing.

Basic details:

Kubernetes version: v1.30
GPU Operator version: v24.6.2
GPU Driver version: v535.183.01
Ubuntu AMI Name: ubuntu-eks/k8s_1.30/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240526
Kernel Version: 6.5.0-1020-aws

Here is the pod status on the v1.30 node:

$ kubectl get pods -n gpu-operator --field-selector spec.nodeName=ip-10-2-87-217.ec2.internal                                                
NAME                                               READY   STATUS             RESTARTS      AGE
gpu-feature-discovery-r88m2                        0/2     Init:0/2           0             2m30s
gpu-operator-node-feature-discovery-worker-qh45g   1/1     Running            0             4h8m
nvidia-container-toolkit-daemonset-wnmsk           0/1     Init:0/1           0             2m30s
nvidia-dcgm-exporter-4q62c                         0/1     Init:0/1           0             2m30s
nvidia-device-plugin-daemonset-ngvbq               0/2     Init:0/2           0             2m30s
nvidia-driver-daemonset-n8m56                      0/1     CrashLoopBackOff   3 (36s ago)   2m33s
nvidia-operator-validator-mwp76                    0/1     Init:0/4           0             2m30s

These pods keep terminating, crashing and getting recreated over and over.

Here are some more logs and info that might help:
Describe nvidia-operator-validator pod - Events show this error

Events:
  Type     Reason                  Age                From               Message
  ----     ------                  ----               ----               -------
  Normal   Scheduled               107s               default-scheduler  Successfully assigned gpu-operator/nvidia-operator-validator-fqt2z to ip-10-2-87-217.ec2.internal
  Warning  FailedCreatePodSandBox  1s (x9 over 107s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Logs of driver daemonset pod:

$ kubectl logs -n gpu-operator nvidia-driver-daemonset-bn5hn -f                                                                              
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.183.01
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.183.01........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 535.183.01 for Linux kernel version 6.5.0-1020-aws

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Pod terminates and crashes after this.

I would appreciate any help in figuring why this is happening and what AMI versions/kernel versions can we use to mitigate this.

  • Finding out what is the kernel version of any given AMI is difficult. If there is a strict kernel version requirement, there should be a mention in the documentation where the user can find the proper AMI for it.
  • The documentation only mentions kernel version requirements once, which includes 5.15 and 6.5. 5.15 no longer ships in AMIs for k8s v1.30 and we tested with 6.5 and it does not work. In Amazon EKS specific docs, there is no mention of kernel version requirements and it states as long as you have a ubuntu 22.04 x86_64 image, you're good.
@runitmisra
Copy link
Author

We also plan to use pre-compiled driver images going forward, but again, no kernel version newer than 5.15 is supported. We are having trouble finding an AMI which is Ubuntu 22.04, has the 5.15 kernel and is compatible with k8s v1.30 and v1.31.

@mukulgit123
Copy link

It's a bit confusing from the docs that, for the version mentioned by you, under the supported operating systems and Kubernetes platforms here, it's mentioned, under the Cloud Service Providers tab, that EKS is supported from v1.25-v1.28. I really doubt if it's the case, since it worked fine on v1.29 for you and only started failing with v1.30. If the document is correct, I will have to think multiple times before upgrading to any version beyond 1.28.

Image

@runitmisra
Copy link
Author

The compatibility issue in my opinion is the kernel version. Nvidia does not provide driver support (either normal or pre-compiled) for any kernel version > 5.15 and Ubuntu does not provide an AMI which is both compatible with k8s v1.30+ AND has kernel v5.15! Same case for pre-compiled drivers

So my question here is: Is there NO WAY to run gpu-operator managed clusters reliably on k8s v1.30 and above??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants