Skip to content

Commit

Permalink
feat(training-operator): first version of the chart
Browse files Browse the repository at this point in the history
  • Loading branch information
sebastien-prudhomme committed Oct 18, 2021
1 parent 5fa6508 commit 1673a03
Show file tree
Hide file tree
Showing 20 changed files with 28,449 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
| [mongo-express](charts/mongo-express) | Web-based MongoDB admin interface, written with Node.js and express |
| [mpi-operator](charts/mpi-operator) | Makes it easy to run allreduce-style distributed training on Kubernetes |
| [quickchart](charts/quickchart) | Chart image and QR code web API |
| [training-operator](charts/training-operator) | Makes it easy to run distributed or non-distributed TensorFlow/PyTorch/MXNet/XGBoost jobs on Kubernetes |
| [vertical-pod-autoscaler](charts/vertical-pod-autoscaler) | Set of components that automatically adjust the amount of CPU and memory requested by pods running in the Kubernetes Cluster |
| [whoami](charts/whoami) | Tiny Go webserver that prints os information and HTTP request to output |

Expand Down
22 changes: 22 additions & 0 deletions charts/training-operator/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
6 changes: 6 additions & 0 deletions charts/training-operator/Chart.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
dependencies:
- name: common
repository: https://charts.bitnami.com/bitnami/
version: 1.7.1
digest: sha256:40f9bf131e797c2ef880e51b4d481bf7bd1f79980fd288d627ac5be8f0563877
generated: "2021-10-18T08:44:50.060653779+02:00"
16 changes: 16 additions & 0 deletions charts/training-operator/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
apiVersion: v2
appVersion: 1.3.0
description: Makes it easy to run distributed or non-distributed TensorFlow/PyTorch/MXNet/XGBoost jobs on Kubernetes
home: https://github.com/kubeflow/training-operator
maintainers:
- name: sebastien-prudhomme
email: [email protected]
name: training-operator
sources:
- https://github.com/kubeflow/training-operator
- https://github.com/cowboysysop/charts/tree/master/charts/training-operator
version: 1.0.0
dependencies:
- name: common
version: 1.7.1
repository: https://charts.bitnami.com/bitnami/
149 changes: 149 additions & 0 deletions charts/training-operator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Training Operator

[Training Operator](https://github.com/kubeflow/training-operator) makes it easy to run distributed or non-distributed TensorFlow/PyTorch/MXNet/XGBoost jobs on Kubernetes.

## TL;DR;

```bash
$ helm repo add cowboysysop https://cowboysysop.github.io/charts/
$ helm install my-release cowboysysop/training-operator
```

## Introduction

This chart bootstraps a Training Operator deployment on a [Kubernetes](http://kubernetes.io) cluster using the [Helm](https://helm.sh) package manager.

## Prerequisites

- Kubernetes 1.16+
- Helm 3.0+

## Installing

Install the chart using:

```bash
$ helm repo add cowboysysop https://cowboysysop.github.io/charts/
$ helm install my-release cowboysysop/training-operator
```

These commands deploy Training Operator on the Kubernetes cluster in the default configuration and with the release name `my-release`. The deployment configuration can be customized by specifying the customization parameters with the `helm install` command using the `--values` or `--set` arguments. Find more information in the [configuration section](#configuration) of this document.

## Upgrading

Replace the custom resource definitions created by the chart using:

```bash
kubectl replace -f crds/mxjob.yaml
kubectl replace -f crds/pytorchjob.yaml
kubectl replace -f crds/tfjob.yaml
kubectl replace -f crds/xgboostjob.yaml
```

Upgrade the chart deployment using:

```bash
$ helm upgrade my-release cowboysysop/training-operator
```

The command upgrades the existing `my-release` deployment with the most latest release of the chart.

**TIP**: Use `helm repo update` to update information on available charts in the chart repositories.

## Uninstalling

Uninstall the `my-release` deployment using:

```bash
$ helm uninstall my-release
```

The command deletes the release named `my-release` and frees all the kubernetes resources associated with the release.

**TIP**: Specify the `--purge` argument to the above command to remove the release from the store and make its name free for later use.

Optionally, delete the custom resource definitions created by the chart using:

```bash
$ kubectl delete crd mxjobs.kubeflow.org
$ kubectl delete crd pytorchjobs.kubeflow.org
$ kubectl delete crd tfjobs.kubeflow.org
$ kubectl delete crd xgboostjobs.kubeflow.org
```

## Configuration

The following tables lists all the configurable parameters expose by the chart and their default values.

### Common parameters

| Name | Description | Default |
|---------------------|--------------------------------------------------------------------------------------------------------|---------|
| `kubeVersion` | Override Kubernetes version | `""` |
| `imagePullSecrets` | Docker registry secret names as an array | `[]` |
| `nameOverride` | Partially override `training-operator.fullname` template with a string (will prepend the release name) | `nil` |
| `fullnameOverride` | Fully override `training-operator.fullname` template with a string | `nil` |
| `commonAnnotations` | Annotations to add to all deployed objects | `{}` |
| `commonLabels` | Labels to add to all deployed objects | `{}` |

### Parameters

| Name | Description | Default |
|--------------------------------------|-------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|
| `replicaCount` | Number of replicas | `1` |
| `image.repository` | Image name | `public.ecr.aws/j1r0q0g6/training/training-operator` |
| `image.tag` | Image tag | `760ac1171dd30039a7363ffa03c77454bd714da5` |
| `image.pullPolicy` | Image pull policy | `IfNotPresent` |
| `pdb.create` | Specifies whether a pod disruption budget should be created | `false` |
| `pdb.minAvailable` | Minimum number/percentage of pods that should remain scheduled | `1` |
| `pdb.maxUnavailable` | Maximum number/percentage of pods that may be made unavailable | `nil` |
| `serviceAccount.create` | Specify whether to create a ServiceAccount | `true` |
| `serviceAccount.annotations` | ServiceAccount annotations | `{}` |
| `serviceAccount.name` | The name of the ServiceAccount to create | Generated using the `training-operator.fullname` template |
| `podAnnotations` | Additional pod annotations | `{}` |
| `podLabels` | Additional pod labels | `{}` |
| `podSecurityContext` | Pod security context | `{}` |
| `priorityClassName` | Priority class name | `nil` |
| `securityContext` | Container security context | `{}` |
| `livenessProbe.enabled` | Enable liveness probe | `true` |
| `livenessProbe.initialDelaySeconds` | Delay before the liveness probe is initiated | `0` |
| `livenessProbe.periodSeconds` | How often to perform the liveness probe | `10` |
| `livenessProbe.timeoutSeconds` | When the liveness probe times out | `1` |
| `livenessProbe.failureThreshold` | Minimum consecutive failures for the liveness probe to be considered failed after having succeeded | `3` |
| `livenessProbe.successThreshold` | Minimum consecutive successes for the liveness probe to be considered successful after having failed | `1` |
| `readinessProbe.enabled` | Enable readiness probe | `true` |
| `readinessProbe.initialDelaySeconds` | Delay before the readiness probe is initiated | `0` |
| `readinessProbe.periodSeconds` | How often to perform the readiness probe | `10` |
| `readinessProbe.timeoutSeconds` | When the readiness probe times out | `1` |
| `readinessProbe.failureThreshold` | Minimum consecutive failures for the readiness probe to be considered failed after having succeeded | `3` |
| `readinessProbe.successThreshold` | Minimum consecutive successes for the readiness probe to be considered successful after having failed | `1` |
| `resources` | CPU/Memory resource requests/limits | `{}` |
| `nodeSelector` | Node labels for pod assignment | `{}` |
| `tolerations` | Tolerations for pod assignment | `[]` |
| `affinity` | Map of node/pod affinities | `{}` |
| `extraArgs` | Additional container arguments | `{}` |
| `extraEnvVars` | Additional container environment variables | `[]` |
| `extraEnvVarsCM` | Name of existing ConfigMap containing additional container environment variables | `nil` |
| `extraEnvVarsSecret` | Name of existing Secret containing additional container environment variables | `nil` |
| `metrics.service.annotations` | Metrics service annotations | {} |
| `metrics.service.type` | Metrics service type | `ClusterIP` |
| `metrics.service.clusterIP` | Metrics static cluster IP address or None for headless service when service type is ClusterIP | `nil` |
| `metrics.service.port` | Metrics service port | `8080` |

Specify the parameters you which to customize using the `--set` argument to the `helm install` command. For instance,

```bash
$ helm install my-release \
--set nameOverride=my-name cowboysysop/training-operator
```

The above command sets the `nameOverride` to `my-name`.

Alternatively, a YAML file that specifies the values for the above parameters can be provided while installing the chart. For example,

```bash
$ helm install my-release \
--values values.yaml cowboysysop/training-operator
```

**Tip**: You can use the default [values.yaml](values.yaml).
1 change: 1 addition & 0 deletions charts/training-operator/ci/default-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
fullnameOverride: training-operator
Loading

0 comments on commit 1673a03

Please sign in to comment.