The Amazon SageMaker HyperPod command-line interface (HyperPod CLI) is a tool that helps manage training jobs on the SageMaker HyperPod clusters orchestrated by Amazon EKS.
This documentation serves as a reference for the available HyperPod CLI commands. For a comprehensive user guide, see Orchestrating SageMaker HyperPod clusters with Amazon EKS in the Amazon SageMaker Developer Guide.
The SageMaker HyperPod CLI is a tool that helps submit training jobs to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of training jobs, including submitting, describing, listing, and canceling jobs, as well as accessing logs and executing commands within the job's containers. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.
- HyperPod CLI currently only supports starting kubeflow/PyTorchJob. To start a job, you need to install Kubeflow Training Operator first.
- You can either follow kubeflow public doc to install it.
- Or you can follow the Readme under helm_chart folder to install Kubeflow Training Operator.
- Configure aws cli to point to the correct region where your HyperPod clusters are located.
SageMaker HyperPod CLI currently supports Linux and MacOS platforms. Windows platform is not supported now.
SageMaker HyperPod CLI currently supports start training job with:
- PyTorch ML Framework. Version requirements: PyTorch >= 1.10
-
Make sure that your local python version is 3.8, 3.9, 3.10 or 3.11.
-
Install
helm
.The SageMaker Hyperpod CLI uses Helm to start training jobs. See also the Helm installation guide.
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh rm -f ./get_helm.sh
-
Clone and install the sagemaker-hyperpod-cli package.
git clone [email protected]:aws/sagemaker-hyperpod-cli.git cd sagemaker-hyperpod-cli pip install .
-
Verify if the installation succeeded by running the following command.
hyperpod --help
-
If you have a running HyperPod cluster, you can try to run a training job using the sample configuration file provided at
/examples/basic-job-example-config.yaml
.- Get your HyperPod clusters to show their capacities.
hyperpod get-clusters
- Connect to one HyperPod cluster and specify a namespace you have access to.
hyperpod connect-cluster --cluster-name <cluster-name>
- Start a job in your cluster. Change the
instance_type
in the yaml file to be same as the one in your HyperPod cluster. Also change thenamespace
you want to submit a job to, the example uses kubeflow namespace. You need to have installed PyTorch in your cluster.hyperpod start-job --config-file ./examples/basic-job-example-config.yaml
- Get your HyperPod clusters to show their capacities.
The HyperPod CLI provides the following commands:
- Getting Clusters
- Connecting to a Cluster
- Submitting a Job
- Getting Job Details
- Listing Jobs
- Canceling a Job
- Listing Pods
- Accessing Logs
- Executing Commands
This command lists the available SageMaker HyperPod clusters and their capacity information.
hyperpod get-clusters [--region <region>] [--clusters <cluster1,cluster2>] [--orchestrator <eks>] [--output <json|table>]
region
(string) - Optional. The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.clusters
(list[string]) - Optional. A list of SageMaker HyperPod cluster names that users want to check the capacity for. This is useful for users who know some of their most commonly used clusters and want to check the capacity status of the clusters in the AWS account.orchestrator
(enum) - Optional. The orchestrator type for the cluster. Currently,'eks'
is the only available option.output
(enum) - Optional. The output format. Available values aretable
andjson
. The default value isjson
.
This command configures the local Kubectl environment to interact with the specified SageMaker HyperPod cluster and namespace.
hyperpod connect-cluster --cluster-name <cluster-name> [--region <region>] [--namespace <namespace>]
cluster-name
(string) - Required. The SageMaker HyperPod cluster name to configure with.region
(string) - Optional. The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.namespace
(string) - Optional. The namespace that you want to connect to. If not specified, this command uses the Kubernetes namespace of the Amazon EKS cluster associated with the SageMaker HyperPod cluster in your AWS account.
This command submits a new training job to the connected SageMaker HyperPod cluster.
hyperpod start-job --job-name <job-name> [--namespace <namespace>] [--job-kind <kubeflow/PyTorchJob>] [--image <image>] [--command <command>] [--entry-script <script>] [--script-args <arg1 arg2>] [--environment <key=value>] [--pull-policy <Always|IfNotPresent|Never>] [--instance-type <instance-type>] [--node-count <count>] [--tasks-per-node <count>] [--label-selector <key=value>] [--deep-health-check-passed-nodes-only] [--scheduler-type <Kueue>] [--queue-name <queue-name>] [--priority <priority>] [--auto-resume] [--max-retry <count>] [--restart-policy <Always|OnFailure|Never|ExitCode>] [--volumes <volume1,volume2>] [--persistent-volume-claims <claim1:/mount/path,claim2:/mount/path>] [--results-dir <dir>] [--service-account-name <account>]
job-name
(string) - Required. The name of the job.job-kind
(string) - Optional. The training job kind. The job type currently supported iskubeflow/PyTorchJob
.namespace
(string) - Optional. The namespace to use. If not specified, this command uses the Kubernetes namespace of the Amazon EKS cluster associated with the SageMaker HyperPod cluster in your AWS account.image
(string) - Required. The image used when creating the training job.pull-policy
(enum) - Optional. The policy to pull the container image. Valid values areAlways
,IfNotPresent
, andNever
, as available from the PyTorchJob. The default isAlways
.command
(string) - Optional. The command to run the entrypoint script. Currently, onlytorchrun
is supported.entry-script
(string) - Required. The path to the training script.script-args
(list[string]) - Optional. The list of arguments for entry scripts.environment
(dict[string, string]) - Optional. The environment variables (key-value pairs) to set in the containers.node-count
(int) - Required. The number of nodes (instances) to launch the jobs on.instance-type
(string) - Required. The instance type to launch the job on. Note that the instance types you can use are the available instances within your SageMaker quotas for instances prefixed withml
.tasks-per-node
(int) - Optional. The number of devices to use per instance.label-selector
(dict[string, list[string]]) - Optional. A dictionary of labels and their values that will override the predefined node selection rules based on the SageMaker HyperPodnode-health-status
label and values. If users provide this field, the CLI will launch the job with this customized label selection.deep-health-check-passed-nodes-only
(bool) - Optional. If set totrue
, the job will be launched only on nodes that have thedeep-health-check-status
label with the valuepassed
.scheduler-type
(enum) - Optional. The scheduler type to use. Currently, onlyKueue
is supported.queue-name
(string) - Optional. The name of the queue to submit the job to, which is created by the cluster admin users in your AWS account.priority
(string) - Optional. The priority for the job, which needs to be created by the cluster admin users and match the name in the cluster.auto-resume
(bool) - Optional. The flag to enable HyperPod resilience job auto resume. If set totrue
, the job will automatically resume after pod or node failure. To enableauto-resume
, you also should setrestart-policy
toOnFailure
.max-retry
(int) - Optional. The maximum number of retries for HyperPod resilience job auto resume. Ifauto-resume
is set to true andmax-retry
is not specified, the default value is 1.restart-policy
(enum) - Optional. The PyTorchJob restart policy, which can beAlways
,OnFailure
,Never
, orExitCode
. The default isOnFailure
. To enableauto-resume
,restart-policy
should be set toOnFailure
.volumes
(list[string]) - Optional. Add a temp directory for containers to store data in the hosts.persistent-volume-claims
(list[string]) - Optional. The pre-created persistent volume claims (PVCs) that the data scientist can choose to mount to the containers. The cluster admin users should create PVCs and provide it to the data scientist users.results-dir
(string) - Optional. The location to store the results, checkpoints, and logs. The cluster admin users should set this up and provide it to the data scientist users. The default value is./results
.service-account-name
- Optional. The Kubernetes service account that allows Pods to access resources based on the permissions granted to that service account. The cluster admin users should create the Kubernetes service account.
This command displays detailed information about a specific training job.
hyperpod get-job --job-name <job-name> [--namespace <namespace>] [--verbose]
job-name
(string) - Required. The name of the job.namespace
(string) - Optional. The namespace to describe the job in. If not provided, the CLI will try to describe the job in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will describe the job from the specified namespace.verbose
(flag) - Optional. If set toTrue
, the command enables verbose mode and prints out more detailed output with additional fields.
This command lists all the training jobs in the connected SageMaker HyperPod cluster or namespace.
hyperpod list-jobs [--namespace <namespace>] [--all-namespaces] [--selector <key=value>]
namespace
(string) - Optional. The namespace to list the jobs in. If not provided, this command lists the jobs in the namespace specified during connecting to the cluster. If the namespace is provided and if the user has access to the namespace, this command lists the jobs from the specified namespace.all-namespaces
(flag) - Optional. If set, this command lists jobs from all namespaces the data scientist users have access to. The namespace in the current AWS account credentials will be ignored, even if specified with the--namespace
option.selector
(string) - Optional. A label selector to filter the listed jobs. The selector supports the '=', '==', and '!=' operators (e.g.,-l key1=value1,key2=value2
).
This command cancels and deletes a running training job.
hyperpod cancel-job --job-name <job-name> [--namespace <namespace>]
job-name
(string) - Required. The name of the job to cancel.namespace
(string) - Optional. The namespace to cancel the job in. If not provided, the CLI will try to cancel the job in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will cancel the job from the specified namespace.
This command lists all the pods associated with a specific training job.
hyperpod list-pods --job-name <job-name> [--namespace <namespace>]
job-name
(string) - Required. The name of the job to list pods for.namespace
(string) - Optional. The namespace to list the pods in. If not provided, the CLI will list the pods in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will list the pods from the specified namespace.
This command retrieves the logs for a specific pod within a training job.
hyperpod get-log --job-name <job-name> --pod <pod-name> [--namespace <namespace>]
job-name
(string) - Required. The name of the job to get the log for.pod
(string) - Required. The name of the pod to get the log from.namespace
(string) - Optional. The namespace to get the log from. If not provided, the CLI will get the log from the pod in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will get the log from the pod in the specified namespace.
This command executes a specified command within the container of a pod associated with a training job.
hyperpod exec --job-name <job-name> [-p <pod-name>] [--all-pods] -- <command>
job-name
(string) - Required. The name of the job to execute the command within the container of a pod associated with a training job.bash-command
(string) - Required. The bash command(s) to run.namespace
(string) - Optional. The namespace to execute the command in. If not provided, the CLI will try to execute the command in the pod in the namespace set by the user while connecting to the cluster. If provided, and the user has access to the namespace, the CLI will execute the command in the pod from the specified namespace.pod
(string) - Optional. The name of the pod to execute the command in. You must provide either--pod
or--all-pods
.all-pods
(flag) - Optional. If set, the command will be executed in all pods associated with the job.