Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: adding Platform Support and ML Framework Support sections in the README; fixing some typos in the README #20

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ This documentation serves as a reference for the available HyperPod CLI commands

## Table of Contents
- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Platform Support](#platform-support)
- [ML Framework Support](#ml-framework-support)
- [Installation](#installation)
- [Usage](#usage)
- [Listing Clusters](#listing-clusters)
Expand All @@ -30,6 +33,15 @@ The SageMaker HyperPod CLI is a tool that helps submit training jobs to the Amaz
- Or you can follow the [Readme under helm_chart folder](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/readme.md) to install Kubeflow Training Operator.
- Configure [aws cli](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html) to point to the correct region where your HyperPod clusters are located.

## Platform Support

SageMaker HyperPod CLI currently supports Linux and MacOS platforms. Windows platform support is working in progress

## ML Framework Support

SageMaker HyperPod CLI currently supports start training job with:
- PyTorch ML Framework. Version requirements: PyTorch > 1.10

## Installation

1. Make sure that your local python version is 3.8, 3.9, 3.10 or 3.11.
Expand Down Expand Up @@ -98,7 +110,7 @@ hyperpod get-clusters [--region <region>] [--clusters <cluster1,cluster2>] [--or
* `region` (string) - Optional. The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
* `clusters` (list[string]) - Optional. A list of SageMaker HyperPod cluster names that users want to check the capacity for. This is useful for users who know some of their most commonly used clusters and want to check the capacity status of the clusters in the AWS account.
* `orchestrator` (enum) - Optional. The orchestrator type for the cluster. Currently, `'eks'` is the only available option.
* `output` (enum) - Optional. The output format. Available values are `TABLE` and `JSON`. The default value is `JSON`.
* `output` (enum) - Optional. The output format. Available values are `table` and `json`. The default value is `json`.

### Connecting to a Cluster

Expand All @@ -121,13 +133,13 @@ hyperpod start-job --job-name <job-name> [--namespace <namespace>] [--job-kind <
```

* `job-name` (string) - Required. The name of the job.
* `job-kind` (string) - Optional. The training job kind. The job types currently supported are `kubeflow` and `PyTorchJob`.
* `job-kind` (string) - Optional. The training job kind. The job type currently supported is `kubeflow/PyTorchJob`.
* `namespace` (string) - Optional. The namespace to use. If not specified, this command uses the [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) of the Amazon EKS cluster associated with the SageMaker HyperPod cluster in your AWS account.
* `image` (string) - Required. The image used when creating the training job.
* `pull-policy` (enum) - Optional. The policy to pull the container image. Valid values are `Always`, `IfNotPresent`, and `Never`, as available from the PyTorchJob. The default is `Always`.
* `command` (string) - Optional. The command to run the entrypoint script. Currently, only `torchrun` is supported.
* `entry-script` (string) - Required. The path to the training script.
* `script-args` (list[string]) - Optional. The list of arguments for entryscripts.
* `script-args` (list[string]) - Optional. The list of arguments for entry scripts.
* `environment` (dict[string, string]) - Optional. The environment variables (key-value pairs) to set in the containers.
* `node-count` (int) - Required. The number of nodes (instances) to launch the jobs on.
* `instance-type` (string) - Required. The instance type to launch the job on. Note that the instance types you can use are the available instances within your SageMaker quotas for instances prefixed with `ml`.
Expand Down
4 changes: 3 additions & 1 deletion src/hyperpod_cli/clients/kubernetes_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,9 @@ def __new__(cls, is_get_capacity: bool = False) -> "KubernetesClient":
if cls._instance is None:
cls._instance = super(KubernetesClient, cls).__new__(cls)
config.load_kube_config(
config_file=KUBE_CONFIG_PATH if not is_get_capacity else TEMP_KUBE_CONFIG_FILE
config_file=KUBE_CONFIG_PATH
if not is_get_capacity
else TEMP_KUBE_CONFIG_FILE
) # or config.load_incluster_config() for in-cluster config
cls._instance._kube_client = client.ApiClient()
return cls._instance
Expand Down
3 changes: 1 addition & 2 deletions src/hyperpod_cli/validators/job_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
RestartPolicy,
KUEUE_QUEUE_NAME_LABEL_KEY,
HYPERPOD_AUTO_RESUME_ANNOTATION_KEY,
HYPERPOD_MAX_RETRY_ANNOTATION_KEY
HYPERPOD_MAX_RETRY_ANNOTATION_KEY,
)
from hyperpod_cli.constants.hyperpod_instance_types import (
HyperpodInstanceType,
Expand Down Expand Up @@ -275,4 +275,3 @@ def _validate_json_str(
# Catch any other exceptions
logger.error(f"An unexpected error occurred: {e}")
return False

Loading