Merge pull request cowboysysop#86 from sebastien-prudhomme/training-o…

…perator feat(training-operator): first version of the chart
sebastien-prudhomme · Oct 18, 2021 · b7e24e5 · b7e24e5
2 parents 5fa6508 + 1673a03
commit b7e24e5
Show file tree

Hide file tree

Showing 20 changed files with 28,449 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -15,6 +15,7 @@
 | [mongo-express](charts/mongo-express)                     | Web-based MongoDB admin interface, written with Node.js and express                                                          |
 | [mpi-operator](charts/mpi-operator)                       | Makes it easy to run allreduce-style distributed training on Kubernetes                                                      |
 | [quickchart](charts/quickchart)                           | Chart image and QR code web API                                                                                              |
+| [training-operator](charts/training-operator)             | Makes it easy to run distributed or non-distributed TensorFlow/PyTorch/MXNet/XGBoost jobs on Kubernetes                      |
 | [vertical-pod-autoscaler](charts/vertical-pod-autoscaler) | Set of components that automatically adjust the amount of CPU and memory requested by pods running in the Kubernetes Cluster |
 | [whoami](charts/whoami)                                   | Tiny Go webserver that prints os information and HTTP request to output                                                      |
 

diff --git a/charts/training-operator/.helmignore b/charts/training-operator/.helmignore
@@ -0,0 +1,22 @@
+# Patterns to ignore when building packages.
+# This supports shell glob matching, relative path matching, and
+# negation (prefixed with !). Only one pattern per line.
+.DS_Store
+# Common VCS dirs
+.git/
+.gitignore
+.bzr/
+.bzrignore
+.hg/
+.hgignore
+.svn/
+# Common backup files
+*.swp
+*.bak
+*.tmp
+*~
+# Various IDEs
+.project
+.idea/
+*.tmproj
+.vscode/
diff --git a/charts/training-operator/Chart.lock b/charts/training-operator/Chart.lock
@@ -0,0 +1,6 @@
+dependencies:
+- name: common
+  repository: https://charts.bitnami.com/bitnami/
+  version: 1.7.1
+digest: sha256:40f9bf131e797c2ef880e51b4d481bf7bd1f79980fd288d627ac5be8f0563877
+generated: "2021-10-18T08:44:50.060653779+02:00"
diff --git a/charts/training-operator/Chart.yaml b/charts/training-operator/Chart.yaml
@@ -0,0 +1,16 @@
+apiVersion: v2
+appVersion: 1.3.0
+description: Makes it easy to run distributed or non-distributed TensorFlow/PyTorch/MXNet/XGBoost jobs on Kubernetes
+home: https://github.com/kubeflow/training-operator
+maintainers:
+  - name: sebastien-prudhomme
+    email: [email protected]
+name: training-operator
+sources:
+  - https://github.com/kubeflow/training-operator
+  - https://github.com/cowboysysop/charts/tree/master/charts/training-operator
+version: 1.0.0
+dependencies:
+  - name: common
+    version: 1.7.1
+    repository: https://charts.bitnami.com/bitnami/
diff --git a/charts/training-operator/README.md b/charts/training-operator/README.md
@@ -0,0 +1,149 @@
+# Training Operator
+
+[Training Operator](https://github.com/kubeflow/training-operator) makes it easy to run distributed or non-distributed TensorFlow/PyTorch/MXNet/XGBoost jobs on Kubernetes.
+
+## TL;DR;
+
+```bash
+$ helm repo add cowboysysop https://cowboysysop.github.io/charts/
+$ helm install my-release cowboysysop/training-operator
+```
+
+## Introduction
+
+This chart bootstraps a Training Operator deployment on a [Kubernetes](http://kubernetes.io) cluster using the [Helm](https://helm.sh) package manager.
+
+## Prerequisites
+
+- Kubernetes 1.16+
+- Helm 3.0+
+
+## Installing
+
+Install the chart using:
+
+```bash
+$ helm repo add cowboysysop https://cowboysysop.github.io/charts/
+$ helm install my-release cowboysysop/training-operator
+```
+
+These commands deploy Training Operator on the Kubernetes cluster in the default configuration and with the release name `my-release`. The deployment configuration can be customized by specifying the customization parameters with the `helm install` command using the `--values` or `--set` arguments. Find more information in the [configuration section](#configuration) of this document.
+
+## Upgrading
+
+Replace the custom resource definitions created by the chart using:
+
+```bash
+kubectl replace -f crds/mxjob.yaml
+kubectl replace -f crds/pytorchjob.yaml
+kubectl replace -f crds/tfjob.yaml
+kubectl replace -f crds/xgboostjob.yaml
+```
+
+Upgrade the chart deployment using:
+
+```bash
+$ helm upgrade my-release cowboysysop/training-operator
+```
+
+The command upgrades the existing `my-release` deployment with the most latest release of the chart.
+
+**TIP**: Use `helm repo update` to update information on available charts in the chart repositories.
+
+## Uninstalling
+
+Uninstall the `my-release` deployment using:
+
+```bash
+$ helm uninstall my-release
+```
+
+The command deletes the release named `my-release` and frees all the kubernetes resources associated with the release.
+
+**TIP**: Specify the `--purge` argument to the above command to remove the release from the store and make its name free for later use.
+
+Optionally, delete the custom resource definitions created by the chart using:
+
+```bash
+$ kubectl delete crd mxjobs.kubeflow.org
+$ kubectl delete crd pytorchjobs.kubeflow.org
+$ kubectl delete crd tfjobs.kubeflow.org
+$ kubectl delete crd xgboostjobs.kubeflow.org
+```
+
+## Configuration
+
+The following tables lists all the configurable parameters expose by the chart and their default values.
+
+### Common parameters
+
+| Name                | Description                                                                                            | Default |
+|---------------------|--------------------------------------------------------------------------------------------------------|---------|
+| `kubeVersion`       | Override Kubernetes version                                                                            | `""`    |
+| `imagePullSecrets`  | Docker registry secret names as an array                                                               | `[]`    |
+| `nameOverride`      | Partially override `training-operator.fullname` template with a string (will prepend the release name) | `nil`   |
+| `fullnameOverride`  | Fully override `training-operator.fullname` template with a string                                     | `nil`   |
+| `commonAnnotations` | Annotations to add to all deployed objects                                                             | `{}`    |
+| `commonLabels`      | Labels to add to all deployed objects                                                                  | `{}`    |
+
+### Parameters
+
+| Name                                 | Description                                                                                           | Default                                                   |
+|--------------------------------------|-------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|
+| `replicaCount`                       | Number of replicas                                                                                    | `1`                                                       |
+| `image.repository`                   | Image name                                                                                            | `public.ecr.aws/j1r0q0g6/training/training-operator`      |
+| `image.tag`                          | Image tag                                                                                             | `760ac1171dd30039a7363ffa03c77454bd714da5`                |
+| `image.pullPolicy`                   | Image pull policy                                                                                     | `IfNotPresent`                                            |
+| `pdb.create`                         | Specifies whether a pod disruption budget should be created                                           | `false`                                                   |
+| `pdb.minAvailable`                   | Minimum number/percentage of pods that should remain scheduled                                        | `1`                                                       |
+| `pdb.maxUnavailable`                 | Maximum number/percentage of pods that may be made unavailable                                        | `nil`                                                     |
+| `serviceAccount.create`              | Specify whether to create a ServiceAccount                                                            | `true`                                                    |
+| `serviceAccount.annotations`         | ServiceAccount annotations                                                                            | `{}`                                                      |
+| `serviceAccount.name`                | The name of the ServiceAccount to create                                                              | Generated using the `training-operator.fullname` template |
+| `podAnnotations`                     | Additional pod annotations                                                                            | `{}`                                                      |
+| `podLabels`                          | Additional pod labels                                                                                 | `{}`                                                      |
+| `podSecurityContext`                 | Pod security context                                                                                  | `{}`                                                      |
+| `priorityClassName`                  | Priority class name                                                                                   | `nil`                                                     |
+| `securityContext`                    | Container security context                                                                            | `{}`                                                      |
+| `livenessProbe.enabled`              | Enable liveness probe                                                                                 | `true`                                                    |
+| `livenessProbe.initialDelaySeconds`  | Delay before the liveness probe is initiated                                                          | `0`                                                       |
+| `livenessProbe.periodSeconds`        | How often to perform the liveness probe                                                               | `10`                                                      |
+| `livenessProbe.timeoutSeconds`       | When the liveness probe times out                                                                     | `1`                                                       |
+| `livenessProbe.failureThreshold`     | Minimum consecutive failures for the liveness probe to be considered failed after having succeeded    | `3`                                                       |
+| `livenessProbe.successThreshold`     | Minimum consecutive successes for the liveness probe to be considered successful after having failed  | `1`                                                       |
+| `readinessProbe.enabled`             | Enable readiness probe                                                                                | `true`                                                    |
+| `readinessProbe.initialDelaySeconds` | Delay before the readiness probe is initiated                                                         | `0`                                                       |
+| `readinessProbe.periodSeconds`       | How often to perform the readiness probe                                                              | `10`                                                      |
+| `readinessProbe.timeoutSeconds`      | When the readiness probe times out                                                                    | `1`                                                       |
+| `readinessProbe.failureThreshold`    | Minimum consecutive failures for the readiness probe to be considered failed after having succeeded   | `3`                                                       |
+| `readinessProbe.successThreshold`    | Minimum consecutive successes for the readiness probe to be considered successful after having failed | `1`                                                       |
+| `resources`                          | CPU/Memory resource requests/limits                                                                   | `{}`                                                      |
+| `nodeSelector`                       | Node labels for pod assignment                                                                        | `{}`                                                      |
+| `tolerations`                        | Tolerations for pod assignment                                                                        | `[]`                                                      |
+| `affinity`                           | Map of node/pod affinities                                                                            | `{}`                                                      |
+| `extraArgs`                          | Additional container arguments                                                                        | `{}`                                                      |
+| `extraEnvVars`                       | Additional container environment variables                                                            | `[]`                                                      |
+| `extraEnvVarsCM`                     | Name of existing ConfigMap containing additional container environment variables                      | `nil`                                                     |
+| `extraEnvVarsSecret`                 | Name of existing Secret containing additional container environment variables                         | `nil`                                                     |
+| `metrics.service.annotations`        | Metrics service annotations                                                                           | {}                                                        |
+| `metrics.service.type`               | Metrics service type                                                                                  | `ClusterIP`                                               |
+| `metrics.service.clusterIP`          | Metrics static cluster IP address or None for headless service when service type is ClusterIP         | `nil`                                                     |
+| `metrics.service.port`               | Metrics service port                                                                                  | `8080`                                                    |
+
+Specify the parameters you which to customize using the `--set` argument to the `helm install` command. For instance,
+
+```bash
+$ helm install my-release \
+    --set nameOverride=my-name cowboysysop/training-operator
+```
+
+The above command sets the `nameOverride` to `my-name`.
+
+Alternatively, a YAML file that specifies the values for the above parameters can be provided while installing the chart. For example,
+
+```bash
+$ helm install my-release \
+    --values values.yaml cowboysysop/training-operator
+```
+
+**Tip**: You can use the default [values.yaml](values.yaml).
diff --git a/charts/training-operator/ci/default-values.yaml b/charts/training-operator/ci/default-values.yaml
@@ -0,0 +1 @@
+fullnameOverride: training-operator