# Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-run-jobs"></a>

The following topics provide procedures and examples of accessing compute nodes and running ML workloads on provisioned SageMaker HyperPod clusters orchestrated with Amazon EKS. Depending on how you have set up the environment on your HyperPod cluster, there are many ways to run ML workloads on HyperPod clusters.

**Note**  
When running jobs via the SageMaker HyperPod CLI or kubectl, HyperPod can track compute utilization (GPU/CPU hours) across namespaces (teams). These metrics power usage reports, which provide:  
Visibility into allocated vs. borrowed resource consumption
Teams resource utilization for auditing (up to 180 days)
Cost attribution aligned with Task Governance policies
To use usage reports, you must install the usage report infrastructure. We strongly recommend configuring [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) to enforce compute quotas and enable granular cost attribution.  
For more information about setting up and generating usage reports, see [Reporting Compute Usage in HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-usage-reporting.html).

**Tip**  
For a hands-on experience and guidance on how to set up and use a SageMaker HyperPod cluster orchestrated with Amazon EKS, we recommend taking this [Amazon EKS Support in SageMaker HyperPod](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e) workshop.

Data scientist users can train foundational models using the EKS cluster set as the orchestrator for the SageMaker HyperPod cluster. Scientists leverage the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) and the native `kubectl` commands to find available SageMaker HyperPod clusters, submit training jobs (Pods), and manage their workloads. The SageMaker HyperPod CLI enables job submission using a training job schema file, and provides capabilities for job listing, description, cancellation, and execution. Scientists can use [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/training/overview/) according to compute quotas managed by HyperPod, and [SageMaker AI-managed MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) to manage ML experiments and training runs. 

**Topics**
+ [Installing the SageMaker HyperPod CLI](sagemaker-hyperpod-eks-run-jobs-access-nodes.md)
+ [SageMaker HyperPod CLI commands](sagemaker-hyperpod-eks-hyperpod-cli-reference.md)
+ [Running jobs using the SageMaker HyperPod CLI](sagemaker-hyperpod-eks-run-jobs-hyperpod-cli.md)
+ [Running jobs using `kubectl`](sagemaker-hyperpod-eks-run-jobs-kubectl.md)

# Installing the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-eks-run-jobs-access-nodes"></a>

SageMaker HyperPod provides the [SageMaker HyperPod command line interface](https://github.com/aws/sagemaker-hyperpod-cli) (CLI) package. 

1. Check if the version of Python on your local machine is between 3.8 and 3.11.

1. Check the prerequisites in the `README` markdown file in the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) package.

1. Clone the SageMaker HyperPod CLI package from GitHub.

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli.git
   ```

1. Install the SageMaker HyperPod CLI.

   ```
   cd sagemaker-hyperpod-cli && pip install .
   ```

1. Test if the SageMaker HyperPod CLI is successfully installed by running the following command. 

   ```
   hyperpod --help
   ```

**Note**  
If you are a data scientist and want to use the SageMaker HyperPod CLI, make sure that your IAM role is set up properly by your cluster admins following the instructions at [IAM users for scientists](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-user) and [Setting up Kubernetes role-based access control](sagemaker-hyperpod-eks-setup-rbac.md).

# SageMaker HyperPod CLI commands
<a name="sagemaker-hyperpod-eks-hyperpod-cli-reference"></a>

The following table summarizes the SageMaker HyperPod CLI commands.

**Note**  
For a complete CLI reference, see [README](https://github.com/aws/sagemaker-hyperpod-cli?tab=readme-ov-file#sagemaker-hyperpod-command-line-interface) in the [SageMaker HyperPod CLI GitHub repository](https://github.com/aws/sagemaker-hyperpod-cli).


| SageMaker HyperPod CLI command | Entity  | Description | 
| --- | --- | --- | 
| hyperpod get-clusters | cluster/access | Lists all clusters to which the user has been enabled with IAM permissions to submit training workloadsGives the current snapshot of whole available instances which are not running any workloads or jobs along with maximum capacity, grouping by health check statuses (ex: BurnInPassed) | 
| hyperpod connect-cluster | cluster/access | Configures kubectl to operate on the specified HyperPod cluster and namespace | 
| hyperpod start-job  | job | Submits the job to targeted cluster-Job name will be unique at namespace level-Users will be able to override yaml spec by passing them as CLI arguments | 
| hyperpod get-job | job | Display metadata of the submitted job | 
| hyperpod list-jobs | job | Lists all jobs in the connected cluster/namespace to which the user has been added with IAM permissions to submit training workloads | 
| hyperpod cancel-job | job | Stops and deletes the job and gives up underlying compute resources. This job cannot be resumed again. A new job needs to be started, if needed. | 
| hyperpod list-pods | pod | Lists all the pods in the given job in a namespace | 
| hyperpod get-log | pod | Retrieves the logs of a particulat pod in a specified job | 
| hyperpod exec | pod | Run the bash command in the shell of the specified pod(s) and publishes the output | 
| hyperpod --help | utility | lists all supported commands | 

# Running jobs using the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli"></a>

To run jobs, make sure that you installed Kubeflow Training Operator in the EKS clusters. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

Run the `hyperpod get-cluster` command to get the list of available HyperPod clusters.

```
hyperpod get-clusters
```

Run the `hyperpod connect-cluster` to configure the SageMaker HyperPod CLI with the EKS cluster orchestrating the HyperPod cluster.

```
hyperpod connect-cluster --cluster-name <hyperpod-cluster-name>
```

Use the `hyperpod start-job` command to run a job. The following command shows the command with required options. 

```
hyperpod start-job \
    --job-name <job-name>
    --image <docker-image-uri>
    --entry-script <entrypoint-script>
    --instance-type <ml.instance.type>
    --node-count <integer>
```

The `hyperpod start-job` command also comes with various options such as job auto-resume and job scheduling.

## Enabling job auto-resume
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-enable-auto-resume"></a>

The `hyperpod start-job` command also has the following options to specify job auto-resume. For enabling job auto-resume to work with the SageMaker HyperPod node resiliency features, you must set the value for the `restart-policy` option to `OnFailure`. The job must be running under the `kubeflow` namespace or a namespace prefixed with `hyperpod`.
+ [--auto-resume <bool>] \$1Optional, enable job auto resume after fails, default is false
+ [--max-retry <int>] \$1Optional, if auto-resume is true, max-retry default value is 1 if not specified
+ [--restart-policy <enum>] \$1Optional, PyTorchJob restart policy. Available values are `Always`, `OnFailure`, `Never` or `ExitCode`. The default value is `OnFailure`. 

```
hyperpod start-job \
    ... // required options \
    --auto-resume true \
    --max-retry 3 \
    --restart-policy OnFailure
```

## Running jobs with scheduling options
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-scheduling"></a>

The `hyperpod start-job` command has the following options to set up the job with queuing mechanisms. 

**Note**  
You need [Kueue](https://kueue.sigs.k8s.io/docs/overview/) installed in the EKS cluster. If you haven't installed, follow the instructions in [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md).
+ [--scheduler-type <enum>] \$1Optional, Specify the scheduler type. The default is `Kueue`.
+ [--queue-name <string>] \$1Optional, Specify the name of the [Local Queue](https://kueue.sigs.k8s.io/docs/concepts/local_queue/) or [Cluster Queue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/) you want to submit with the job. The queue should be created by cluster admins using `CreateComputeQuota`.
+ [--priority <string>] \$1Optional, Specify the name of the [Workload Priority Class](https://kueue.sigs.k8s.io/docs/concepts/workload_priority_class/), which should be created by cluster admins.

```
hyperpod start-job \
    ... // required options
    --scheduler-type Kueue \
    --queue-name high-priority-queue \
    --priority high
```

## Running jobs from a configuration file
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-from-config"></a>

As an alternative, you can create a job configuration file containing all the parameters required by the job and then pass this config file to the `hyperpod start-job` command using the --config-file option. In this case:

1. Create your job configuration file with the required parameters. Refer to the job configuration file in the SageMaker HyperPod CLI GitHub repository for a [baseline configuration file](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-run-jobs-hyperpod-cli.html#sagemaker-hyperpod-eks-hyperpod-cli-from-config).

1. Start the job using the configuration file as follows.

   ```
   hyperpod start-job --config-file /path/to/test_job.yaml
   ```

**Tip**  
For a complete list of parameters of the `hyperpod start-job` command, see the [Submitting a Job](https://github.com/aws/sagemaker-hyperpod-cli?tab=readme-ov-file#submitting-a-job) section in the `README.md` of the SageMaker HyperPod CLI GitHub repository.

# Running jobs using `kubectl`
<a name="sagemaker-hyperpod-eks-run-jobs-kubectl"></a>

**Note**  
Training job auto resume requires Kubeflow Training Operator release version `1.7.0`, `1.8.0`, or `1.8.1`.

Note that you should install Kubeflow Training Operator in the clusters using a Helm chart. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md). Verify if the Kubeflow Training Operator control plane is properly set up by running the following command.

```
kubectl get pods -n kubeflow
```

This should return an output similar to the following.

```
NAME                                             READY   STATUS    RESTARTS   AGE
training-operator-658c68d697-46zmn               1/1     Running   0          90s
```

**To submit a training job**

To run a training jobs, prepare the job configuration file and run the [https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply) command as follows.

```
kubectl apply -f /path/to/training_job.yaml
```

**To describe a training job**

To retrieve the details of the job submitted to the EKS cluster, use the following command. It returns job information such as the job submission time, completion time, job status, configuration details.

```
kubectl get -o yaml training-job -n kubeflow
```

**To stop a training job and delete EKS resources**

To stop a training job, use kubectl delete. The following is an example of stopping the training job created from the configuration file `pytorch_job_simple.yaml`.

```
kubectl delete -f /path/to/training_job.yaml 
```

This should return the following output.

```
pytorchjob.kubeflow.org "training-job" deleted
```

**To enable job auto-resume**

SageMaker HyperPod supports job auto-resume functionality for Kubernetes jobs, integrating with the Kubeflow Training Operator control plane.

Ensure that there are sufficient nodes in the cluster that have passed the SageMaker HyperPod health check. The nodes should have the taint `sagemaker.amazonaws.com/node-health-status` set to `Schedulable`. It is recommended to include a node selector in the job YAML file to select nodes with the appropriate configuration as follows.

```
sagemaker.amazonaws.com/node-health-status: Schedulable
```

The following code snippet is an example of how to modify a Kubeflow PyTorch job YAML configuration to enable the job auto-resume functionality. You need to add two annotations and set `restartPolicy` to `OnFailure` as follows.

```
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob 
metadata:
    name: pytorch-simple
    namespace: kubeflow
    annotations: { // config for job auto resume
      sagemaker.amazonaws.com/enable-job-auto-resume: "true"
      sagemaker.amazonaws.com/job-max-retry-count: "2"
    }
spec:
  pytorchReplicaSpecs:
  ......
  Worker:
      replicas: 10
      restartPolicy: OnFailure
      template:
          spec:
              nodeSelector:
                  sagemaker.amazonaws.com/node-health-status: Schedulable
```

**To check the job auto-resume status**

Run the following command to check the status of job auto-resume.

```
kubectl describe pytorchjob -n kubeflow <job-name>
```

Depending on the failure patterns, you might see two patterns of Kubeflow training job restart as follows.

**Pattern 1**:

```
Start Time:    2024-07-11T05:53:10Z
Events:
  Type     Reason                   Age                    From                   Message
  ----     ------                   ----                   ----                   -------
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-0
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-1
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m59s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-0
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-1
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m58s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.
```

**Pattern 2**: 

```
Events:
  Type    Reason                   Age    From                   Message
  ----    ------                   ----   ----                   -------
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-master-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-master-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-master-0
```