feat(civo-github): add gpu operator to allow use of GPU nodes #789

mrsimonemms · 2024-08-15T13:23:57Z

Description

This is a piece of work that requires a bit of thought. For Civo-GitHub, I've added the ability to use GPU nodes. This requires some changes to how things work which could cause problems.

Uncontroversial changes

Install the NVIDIA GPU operator. This is based upon the work done in https://github.com/civo-learn/civo-gpu-operator-tf, but tailored it to our needs. I don't particularly like the is_gpu flag as it assumes that all GPU nodes start g4g. or an. which may not always be true - if anyone has a better idea how to achieve this, I'm all ears (I had hoped that data.civo_size would have it, but it doesn't).

Controversial changes

I've had to bump Crossplane to v1.16.0 (from v1.12.2). This is because I need to use the helm_release Terraform resource, which has a problem with the Crossplane provider not being able to download the charts (see Terraform helm provider cannot retrieve chart crossplane-contrib/provider-terraform#54). In order to fix this, I need to use an emptyDir volume mount on the Crossplane provider's pod and the v1.12.2 version of the CRD doesn't have it in ControllerConfig.pkg.crossplane.io/v1alpha1
This means that civo-github has a different Crossplane version to all the other providers which I feel should probably be consistent.
We probably should also switch from using ControllerConfig as it's deprecated as of v1.11 and will be removed at some future date. Again, this is a big change to do across all providers.

Related Issue(s)

Fixes #

How to test

IMPORTANT* the cheapest GPU node costs $1,200 per month ($1+ an hour) - don't leave it running.

Deploy Civo GitHub

/path/to/kubefirst civo create \
--alerts-email <EMAIL> \
--github-org <ORG> \
--cluster-name <NAME> \
--domain-name <DOMAIN> \
--gitops-template-branch sje/crossplane-version \
--cloud-region LON1

In the console, deploy a single node GPU cluster - suggest using a g4g.40.kube.small

patrickdappollonio · 2024-08-20T15:47:15Z

civo-github/terraform/civo/modules/workload-cluster/gpu.tf

+  metadata {
+    name = "gpu-operator"
+    labels = {
+      "pod-security.kubernetes.io/enforce" = "privileged"


Any reason why the limitation?

It's required in the NVIDIA docs if you're using pod security admissions. As this is a PoC, I think it should be in to avoid any issues where it's not applied.

I don't think I read it as "it's required". The statement starts with "If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods" but I'm not sure if you installed it that it makes this a requirement.

patrickdappollonio · 2024-08-20T15:52:35Z

civo-github/terraform/civo/modules/workload-cluster/gpu.tf

+            touch /run/nvidia/validations/toolkit-ready;
+            touch /run/nvidia/validations/.driver-ctr-ready;
+            touch /run/nvidia/validations/driver-ready
+            sleep infinity


I would recommend changing this (and maybe the container) to a kubernetes/pause container that can correctly assert or trap signals.

sleep infinity inside a bash command will only trap a signal after it has been issued if the sleep command is done (which in this case it never is). That's why kubernetes relies better on kubernetes/pause than any sleep mechanism since they are a bit wasteful for this purpose.

More info: https://mywiki.wooledge.org/SignalTrap#When_is_the_signal_handled.3F

This is straight out of the Civo guide. I don't particularly like this, but the fake-operator pod is required to prompt the operator to apply the labels/annotations to the node.

I'll have a look at your suggestion, but this might need to be stay as-is if can't get it to work.

I see. I know it's required (several other things come out of similar things like enable higher file descriptor limits so this pattern is quite common).

While that might be in thekr guide, I would want to avoid ourselves having to troubleshoot struggling-to-finalize pods if we can leverage pause instead. Happy to help you make the conversion if need be!

civo-github/terraform/civo/modules/workload-cluster/main.tf

mrsimonemms force-pushed the sje/civo-ai-github branch 6 times, most recently from 935f87b to 39006e9 Compare August 20, 2024 14:47

mrsimonemms added 3 commits August 20, 2024 14:48

feat(civo-github): configure gpu operator on gpu nodes

3266f90

feat: bump crossplane to latest version 1.16.0

f233181

feat: add helm cache directory for terraform provider

d1a196d

mrsimonemms force-pushed the sje/civo-ai-github branch from 39006e9 to d1a196d Compare August 20, 2024 14:49

mrsimonemms marked this pull request as ready for review August 20, 2024 15:13

patrickdappollonio requested changes Aug 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(civo-github): add gpu operator to allow use of GPU nodes #789

feat(civo-github): add gpu operator to allow use of GPU nodes #789

mrsimonemms commented Aug 15, 2024 •

edited

Loading

patrickdappollonio Aug 20, 2024

mrsimonemms Aug 21, 2024

patrickdappollonio Aug 21, 2024

patrickdappollonio Aug 20, 2024

mrsimonemms Aug 21, 2024

patrickdappollonio Aug 21, 2024

feat(civo-github): add gpu operator to allow use of GPU nodes #789

Are you sure you want to change the base?

feat(civo-github): add gpu operator to allow use of GPU nodes #789

Conversation

mrsimonemms commented Aug 15, 2024 • edited Loading

Description

Uncontroversial changes

Controversial changes

Related Issue(s)

How to test

patrickdappollonio Aug 20, 2024

Choose a reason for hiding this comment

mrsimonemms Aug 21, 2024

Choose a reason for hiding this comment

patrickdappollonio Aug 21, 2024

Choose a reason for hiding this comment

patrickdappollonio Aug 20, 2024

Choose a reason for hiding this comment

mrsimonemms Aug 21, 2024

Choose a reason for hiding this comment

patrickdappollonio Aug 21, 2024

Choose a reason for hiding this comment

mrsimonemms commented Aug 15, 2024 •

edited

Loading