Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(civo-github): add gpu operator to allow use of GPU nodes #789

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mrsimonemms
Copy link
Contributor

@mrsimonemms mrsimonemms commented Aug 15, 2024

Description

This is a piece of work that requires a bit of thought. For Civo-GitHub, I've added the ability to use GPU nodes. This requires some changes to how things work which could cause problems.

Uncontroversial changes

  • Install the NVIDIA GPU operator. This is based upon the work done in https://github.com/civo-learn/civo-gpu-operator-tf, but tailored it to our needs. I don't particularly like the is_gpu flag as it assumes that all GPU nodes start g4g. or an. which may not always be true - if anyone has a better idea how to achieve this, I'm all ears (I had hoped that data.civo_size would have it, but it doesn't).

Controversial changes

  • I've had to bump Crossplane to v1.16.0 (from v1.12.2). This is because I need to use the helm_release Terraform resource, which has a problem with the Crossplane provider not being able to download the charts (see Terraform helm provider cannot retrieve chart crossplane-contrib/provider-terraform#54). In order to fix this, I need to use an emptyDir volume mount on the Crossplane provider's pod and the v1.12.2 version of the CRD doesn't have it in ControllerConfig.pkg.crossplane.io/v1alpha1
  • This means that civo-github has a different Crossplane version to all the other providers which I feel should probably be consistent.
  • We probably should also switch from using ControllerConfig as it's deprecated as of v1.11 and will be removed at some future date. Again, this is a big change to do across all providers.

Related Issue(s)

Fixes #

How to test

  • IMPORTANT* the cheapest GPU node costs $1,200 per month ($1+ an hour) - don't leave it running.
  1. Deploy Civo GitHub
/path/to/kubefirst civo create \
--alerts-email <EMAIL> \
--github-org <ORG> \
--cluster-name <NAME> \
--domain-name <DOMAIN> \
--gitops-template-branch sje/crossplane-version \
--cloud-region LON1
  1. In the console, deploy a single node GPU cluster - suggest using a g4g.40.kube.small
    image

@mrsimonemms mrsimonemms force-pushed the sje/civo-ai-github branch 6 times, most recently from 935f87b to 39006e9 Compare August 20, 2024 14:47
@mrsimonemms mrsimonemms marked this pull request as ready for review August 20, 2024 15:13
metadata {
name = "gpu-operator"
labels = {
"pod-security.kubernetes.io/enforce" = "privileged"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why the limitation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's required in the NVIDIA docs if you're using pod security admissions. As this is a PoC, I think it should be in to avoid any issues where it's not applied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I read it as "it's required". The statement starts with "If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods" but I'm not sure if you installed it that it makes this a requirement.

touch /run/nvidia/validations/toolkit-ready;
touch /run/nvidia/validations/.driver-ctr-ready;
touch /run/nvidia/validations/driver-ready
sleep infinity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend changing this (and maybe the container) to a kubernetes/pause container that can correctly assert or trap signals.

sleep infinity inside a bash command will only trap a signal after it has been issued if the sleep command is done (which in this case it never is). That's why kubernetes relies better on kubernetes/pause than any sleep mechanism since they are a bit wasteful for this purpose.

More info: https://mywiki.wooledge.org/SignalTrap#When_is_the_signal_handled.3F

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is straight out of the Civo guide. I don't particularly like this, but the fake-operator pod is required to prompt the operator to apply the labels/annotations to the node.

I'll have a look at your suggestion, but this might need to be stay as-is if can't get it to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I know it's required (several other things come out of similar things like enable higher file descriptor limits so this pattern is quite common).

While that might be in thekr guide, I would want to avoid ourselves having to troubleshoot struggling-to-finalize pods if we can leverage pause instead. Happy to help you make the conversion if need be!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants