

# Tutorials
<a name="sagemaker-hyperpod-recipes-tutorials"></a>

The following quick-start tutorials help you get started with using the recipes for training:
+ SageMaker HyperPod with Slurm Orchestration
  + Pre-training
    + [HyperPod Slurm cluster pre-training tutorial (GPU)](hyperpod-gpu-slurm-pretrain-tutorial.md)
    + [Trainium Slurm cluster pre-training tutorial](hyperpod-trainium-slurm-cluster-pretrain-tutorial.md)
  + Fine-tuning
    + [HyperPod Slurm cluster PEFT-Lora tutorial (GPU)](hyperpod-gpu-slurm-peft-lora-tutorial.md)
    + [HyperPod Slurm cluster DPO tutorial (GPU)](hyperpod-gpu-slurm-dpo-tutorial.md)
+ SageMaker HyperPod with K8s Orchestration
  + Pre-training
    + [Kubernetes cluster pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial.md)
    + [Trainium SageMaker training jobs pre-training tutorial](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)
+ SageMaker training jobs
  + Pre-training
    + [SageMaker training jobs pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md)
    + [Trainium SageMaker training jobs pre-training tutorial](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)

# HyperPod Slurm cluster pre-training tutorial (GPU)
<a name="hyperpod-gpu-slurm-pretrain-tutorial"></a>

The following tutorial sets up Slurm environment and starts a training job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment to run the recipe, make sure you have:  
Set up a HyperPod GPU Slurm cluster.  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or an NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## HyperPod GPU Slurm environment setup
<a name="hyperpod-gpu-slurm-environment-setup"></a>

To initiate a training job on a HyperPod GPU Slurm cluster, do the following:

1. SSH into the head node of your Slurm cluster.

1. After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

   ```
   #set up a virtual environment
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location.

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
   git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
   cd sagemaker-hyperpod-recipes
   pip3 install -r requirements.txt
   ```

1. Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). To gain a deeper understanding of how to use the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

   ```
   REGION="<region>"
   IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
   aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
   enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
   mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
   ```

1. To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

   ```
   container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
   ```

## Launch the training job
<a name="hyperpod-gpu-slurm-launch-training-job"></a>

After you install the dependencies, start a training job from the `sagemaker-hyperpod-recipes/launcher_scripts` directory. You get the dependencies by cloning the [SageMaker HyperPod recipes repository](https://github.com/aws/sagemaker-hyperpod-recipes):

First, pick your training recipe from Github, the model name is specified as part of the recipe. We use the `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh` script to launch a Llama 8b with sequence length 8192 pre-training recipe, `llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain`, in the following example.
+ `IMAGE`: The container from the environment setup section.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment ouput directory
EXP_DIR="${YOUR_EXP_DIR}"

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
  recipes=training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
  base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
  recipes.run.name="hf_llama3_8b" \
  recipes.exp_manager.exp_dir="$EXP_DIR" \
  recipes.model.data.train_dir="$TRAIN_DIR" \
  recipes.model.data.val_dir="$VAL_DIR" \
  container="${IMAGE}" \
  +cluster.container_mounts.0="/fsx:/fsx"
```

After you've configured all the required parameters in the launcher script, you can run the script using the following command.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# Trainium Slurm cluster pre-training tutorial
<a name="hyperpod-trainium-slurm-cluster-pretrain-tutorial"></a>

The following tutorial sets up a Trainium environment on a Slurm cluster and starts a training job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up a SageMaker HyperPod Trainium Slurm cluster.
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the Trainium environment on the Slurm Cluster
<a name="hyperpod-trainium-slurm-cluster-pretrain-setup-trainium-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the Neuron environment. For information about setting up Neuron, see [Neuron setup steps](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT.html#setting-up-the-environment). We recommend relying on the Deep learning AMI's that come pre-installed with Neuron's drivers, such as [Ubuntu 20 with DLAMI Pytorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-pytorch-dlami.html#setup-torch-neuronx-ubuntu20-dlami-pytorch).
+ Clone the SageMaker HyperPod recipes repository to a shared storage location in the cluster. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Go through the following tutorial: [HuggingFace Llama3-8B Pretraining](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#)
+ Prepare a model configuration. The model configurations available in the Neuron repo. For the model configuration used the in this tutorial, see [llama3 8b model config](https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json)

## Launch the training job in Trainium
<a name="hyperpod-trainium-slurm-cluster-pretrain-launch-training-job-trainium"></a>

To launch a training job in Trainium, specify a cluster configuration and a Neuron recipe. For example, to launch a llama3 8b pre-training job in Trainium, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh`, to the following:
+ `MODEL_CONFIG`: The model config from the environment setup section
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash

#Users should set up their cluster type in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

COMPILE=0
TRAIN_DIR="${TRAIN_DIR}" # Location of training dataset
MODEL_CONFIG="${MODEL_CONFIG}" # Location of config.json for the model

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    instance_type="trn1.32xlarge" \
    recipes.run.compile="$COMPILE" \
    recipes.run.name="hf-llama3-8b" \
    recipes.trainer.num_nodes=4 \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    recipes.data.train_dir="$TRAIN_DIR" \
    recipes.model.model_config="$MODEL_CONFIG"
```

To launch the training job, run the following command:

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# HyperPod Slurm cluster DPO tutorial (GPU)
<a name="hyperpod-gpu-slurm-dpo-tutorial"></a>

The following tutorial sets up a Slurm environment and starts a direct preference optimization (DPO) job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up HyperPod GPU Slurm cluster  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
A tokenized binary preference dataset in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the HyperPod GPU Slurm environment
<a name="hyperpod-gpu-slurm-dpo-hyperpod-gpu-slurm-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). For more information about using the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## Launch the training job
<a name="hyperpod-gpu-slurm-dpo-launch-training-job"></a>

To launch a DPO job for the Llama 8 billion parameter model with a sequence length of 8192 on a single Slurm compute node, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh`, to the following:
+ `IMAGE`: The container from the environment setup section.
+ `HF_MODEL_NAME_OR_PATH`: Define the name or the path of the pre-trained weights in the hf\$1model\$1name\$1or\$1path parameter of the recipe.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

**Note**  
The reference model used for DPO in this setup is automatically derived from the base model being trained (no separate reference model is explicitly defined). DPO specific hyperparameters are preconfigured with the following default values:  
`beta`: 0.1 (controls the strength of KL divergence regularization)
`label_smoothing`: 0.0 (no smoothing applied to preference labels)

```
recipes.dpo.beta=${BETA}
recipes.dpo.label_smoothing=${LABEL_SMOOTHING}
```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset
# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${HF_MODEL_NAME_OR_PATH}"
BETA="${BETA}"
LABEL_SMOOTHING="${LABEL_SMOOTHING}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_dpo \
base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
recipes.run.name="hf_llama3_dpo" \
recipes.exp_manager.exp_dir="$EXP_DIR" \
recipes.model.data.train_dir="$TRAIN_DIR" \
recipes.model.data.val_dir="$VAL_DIR" \
recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
container="${IMAGE}" \
+cluster.container_mounts.0="/fsx:/fsx" \
recipes.model.hf_access_token="${HF_ACCESS_TOKEN}" \
recipes.dpo.enabled=true \
recipes.dpo.beta="${BETA}" \
recipes.dpo.label_smoothing="${LABEL_SMOOTHING}$" \
```

After you've configured all the required parameters in the preceding script, you can initiate the training job by running it.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# HyperPod Slurm cluster PEFT-Lora tutorial (GPU)
<a name="hyperpod-gpu-slurm-peft-lora-tutorial"></a>

The following tutorial sets up Slurm environment and starts a parameter-efficient fine-tuning (PEFT) job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up HyperPod GPU Slurm cluster  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the HyperPod GPU Slurm environment
<a name="hyperpod-gpu-slurm-peft-lora-setup-hyperpod-gpu-slurm-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). For more information about using the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## Launch the training job
<a name="hyperpod-gpu-slurm-peft-lora-launch-training-job"></a>

To launch a PEFT job for the Llama 8 billion parameter model with a sequence length of 8192 on a single Slurm compute node, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh`, to the following:
+ `IMAGE`: The container from the environment setup section.
+ `HF_MODEL_NAME_OR_PATH`: Define the name or the path of the pre-trained weights in the hf\$1model\$1name\$1or\$1path parameter of the recipe.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${YOUR_HF_MODEL_NAME_OR_PATH}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
    recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_lora \
    base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
    recipes.run.name="hf_llama3_lora" \
    recipes.exp_manager.exp_dir="$EXP_DIR" \
    recipes.model.data.train_dir="$TRAIN_DIR" \
    recipes.model.data.val_dir="$VAL_DIR" \
    recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
    container="${IMAGE}" \
    +cluster.container_mounts.0="/fsx:/fsx" \
    recipes.model.hf_access_token="${HF_ACCESS_TOKEN}"
```

After you've configured all the required parameters in the preceding script, you can initiate the training job by running it.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# Kubernetes cluster pre-training tutorial (GPU)
<a name="sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial"></a>

There are two ways to launch a training job in a GPU Kubernetes cluster:
+ (Recommended) [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli)
+ The NeMo style launcher

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
A HyperPod GPU Kubernetes cluster is setup properly.
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## GPU Kubernetes environment setup
<a name="sagemaker-hyperpod-gpu-kubernetes-environment-setup"></a>

To set up a GPU Kubernetes environment, do the following:
+ Set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Install dependencies using one of the following methods:
  + (Recommended): [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli) method:

    ```
    # install HyperPod command line tools
    git clone https://github.com/aws/sagemaker-hyperpod-cli
    cd sagemaker-hyperpod-cli
    pip3 install .
    ```
  + SageMaker HyperPod recipes method:

    ```
    # install SageMaker HyperPod Recipes.
    git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
    cd sagemaker-hyperpod-recipes
    pip3 install -r requirements.txt
    ```
+ [Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
+ [Install Helm](https://helm.sh/docs/intro/install/)
+ Connect to your Kubernetes cluster

  ```
  aws eks update-kubeconfig --region "CLUSTER_REGION" --name "CLUSTER_NAME"
  hyperpod connect-cluster --cluster-name "CLUSTER_NAME" [--region "CLUSTER_REGION"] [--namespace <namespace>]
  ```

## Launch the training job with the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-cli"></a>

We recommend using the SageMaker HyperPod command-line interface (CLI) tool to submit your training job with your configurations. The following example submits a training job for the `hf_llama3_8b_seq16k_gpu_p5x16_pretrain` model.
+ `your_training_container`: A Deep Learning container. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
"recipes.run.name": "hf-llama3-8b",
"recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
"container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.model.data.train_dir": "<your_train_data_dir>",
"recipes.model.data.val_dir": "<your_val_data_dir>",
"cluster": "k8s",
"cluster_type": "k8s"
}'
```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` becomes `Completed` when you run `kubectl get pods`.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-recipes"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating `k8s.yaml`, `config.yaml`, and running the launch script.
+ In `k8s.yaml`, update `persistent_volume_claims`. It mounts the Amazon FSx claim to the `/data` directory of each computing pod

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ In `config.yaml`, update `repo_url_or_path` under `git`.

  ```
  git:
    repo_url_or_path: <training_adapter_repo>
    branch: null
    commit: null
    entry_script: null
    token: null
  ```
+ Update `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh`
  + `your_contrainer`: A Deep Learning container. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).
  + (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

    ```
    recipes.model.hf_access_token=<your_hf_token>
    ```

  ```
  #!/bin/bash
  #Users should setup their cluster type in /recipes_collection/config.yaml
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, ect
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
      recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
      base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
      recipes.run.name="hf-llama3" \
      recipes.exp_manager.exp_dir="$EXP_DIR" \
      cluster=k8s \
      cluster_type=k8s \
      container="${IMAGE}" \
      recipes.model.data.train_dir=$TRAIN_DIR \
      recipes.model.data.val_dir=$VAL_DIR
  ```
+ Launch the training job

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
  ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
```

```
NAME READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod <name-of-pod>
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

For more information about the k8s cluster configuration, see [Running a training job on HyperPod k8s](cluster-specific-configurations-run-training-job-hyperpod-k8s.md).

# Trainium Kubernetes cluster pre-training tutorial
<a name="sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial"></a>

You can use one of the following methods to start a training job in a Trainium Kubernetes cluster.
+ (Recommended) [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli)
+ The NeMo style launcher

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up a HyperPod Trainium Kubernetes cluster
A shared storage location that can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up your Trainium Kubernetes environment
<a name="sagemaker-hyperpod-trainium-setup-trainium-kubernetes-environment"></a>

To set up the Trainium Kubernetes environment, do the following:

1. Complete the steps in the following tutorial: [HuggingFace Llama3-8B Pretraining](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#download-the-dataset) starting from **Download the dataset**. 

1. Prepare a model configuration. They're available in the Neuron repo. For this tutorial, you can use the llama3 8b model config.

1. Virtual environment setup. Make sure you're using Python 3.9 or greater.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. Install the dependencies
   + (Recommended) Use the following HyperPod command-line tool

     ```
     # install HyperPod command line tools
     git clone https://github.com/aws/sagemaker-hyperpod-cli
     cd sagemaker-hyperpod-cli
     pip3 install .
     ```
   + If you're using SageMaker HyperPod recipes, specify the following

     ```
     # install SageMaker HyperPod Recipes.
     git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
     cd sagemaker-hyperpod-recipes
     pip3 install -r requirements.txt
     ```

1. [Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   hyperpod connect-cluster --cluster-name "${CLUSTER_NAME}" [--region "${CLUSTER_REGION}"] [--namespace <namespace>]
   ```

1. Container: The [Neuron container](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx)

## Launch the training job with the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-trainium-launch-training-job-cli"></a>

We recommend using the SageMaker HyperPod command-line interface (CLI) tool to submit your training job with your configurations. The following example submits a training job for the `hf_llama3_8b_seq8k_trn1x4_pretrain` Trainium model.
+ `your_neuron_container`: The [Neuron container](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx).
+ `your_model_config`: The model configuration from the environment setup section
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
 "cluster": "k8s",
 "cluster_type": "k8s",
 "container": "<your_neuron_contrainer>",
 "recipes.run.name": "hf-llama3",
 "recipes.run.compile": 0,
 "recipes.model.model_config": "<your_model_config>",
 "instance_type": "trn1.32xlarge",
 "recipes.data.train_dir": "<your_train_data_dir>"
}'
```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                              READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-trainium-launch-training-job-recipes"></a>

Alternatively, use SageMaker HyperPod recipes to submit your training job. To submit the training job using a recipe, update `k8s.yaml` and `config.yaml`. Run the bash script for the model to launch it.
+ In `k8s.yaml`, update persistent\$1volume\$1claims to mount the Amazon FSx claim to the /data directory in the compute nodes

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ Update launcher\$1scripts/llama/run\$1hf\$1llama3\$18b\$1seq8k\$1trn1x4\$1pretrain.sh
  + `your_neuron_contrainer`: The container from the environment setup section
  + `your_model_config`: The model config from the environment setup section

  (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

  ```
   #!/bin/bash
  #Users should set up their cluster type in /recipes_collection/config.yaml
  IMAGE="<your_neuron_contrainer>"
  MODEL_CONFIG="<your_model_config>"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    recipes.run.name="hf-llama3-8b" \
    instance_type=trn1.32xlarge \
    recipes.model.model_config="$MODEL_CONFIG" \
    cluster=k8s \
    cluster_type=k8s \
    container="${IMAGE}" \
    recipes.data.train_dir=$TRAIN_DIR \
    recipes.data.val_dir=$VAL_DIR
  ```
+ Launch the job

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
  ```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is at `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

For more information about the k8s cluster configuration, see [Trainium Kubernetes cluster pre-training tutorial](#sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial).

# SageMaker training jobs pre-training tutorial (GPU)
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial"></a>

This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with GPU instances.
+ Set up your environment
+ Launch a training job using SageMaker HyperPod recipes

Before you begin, make sure you have following prerequisites.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Amazon FSx file system or an Amazon S3 bucket where you can load the data and output the training artifacts.
Requested a Service Quota for 1x ml.p4d.24xlarge and 1x ml.p5.48xlarge on Amazon SageMaker AI. To request a service quota increase, do the following:  
On the AWS Service Quotas console, navigate to AWS services,
Choose **Amazon SageMaker AI**.
Choose one ml.p4d.24xlarge and one ml.p5.48xlarge instance.
Create an AWS Identity and Access Management(IAM) role with the following managed policies to give SageMaker AI permissions to run the examples.  
AmazonSageMakerFullAccess
AmazonEC2FullAccess
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## GPU SageMaker training jobs environment setup
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-environment-setup"></a>

Before you run a SageMaker training job, configure your AWS credentials and preferred region by running the `aws configure` command. As an alternative to the configure command, you can provide your credentials through environment variables such as `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN.` For more information, see [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk).

We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).
+ (Optional) Set up the virtual environment and dependencies. If you are using a Jupyter notebook in Amazon SageMaker Studio, you can skip this step. Make sure you're using Python 3.9 or greater.

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  # Set the aws region.
  
  aws configure set <your_region>
  ```
+ Install SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ `Container`: The GPU container is set automatically by the SageMaker AI Python SDK. You can also provide your own container.
**Note**  
If you're running a Llama 3.2 multi-modal training job, the `transformers` version must be `4.45.2 `or greater.

  Append `transformers==4.45.2` to `requirements.txt` in `source_dir` only when you're using the SageMaker AI Python SDK. For example, append it if you're using it in a notebook in SageMaker AI JupyterLab.

  If you are using HyperPod recipes to launch using cluster type `sm_jobs`, this will be done automatically.

## Launch the training job using a Jupyter Notebook
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-notebook"></a>

You can use the following Python code to run a SageMaker training job with your recipe. It leverages the PyTorch estimator from the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) to submit the recipe. The following example launches the llama3-8b recipe on the SageMaker AI Training platform.

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket() 
output = os.path.join(f"s3://{bucket}", "output")
output_path = "<s3-URI>"

overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-recipe",
    role=role,
    instance_type="ml.p5.48xlarge",
    training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
    recipe_overrides=recipe_overrides,
    sagemaker_session=sagemaker_session,
    tensorboard_output_config=tensorboard_output_config,
)

estimator.fit(inputs={"train": "s3 or fsx input", "val": "s3 or fsx input"}, wait=True)
```

The preceding code creates a PyTorch estimator object with the training recipe and then fits the model using the `fit()` method. Use the training\$1recipe parameter to specify the recipe you want to use for training.

**Note**  
If you're running a Llama 3.2 multi-modal training job, the transformers version must be 4.45.2 or greater.

Append `transformers==4.45.2` to `requirements.txt` in `source_dir` only when you're using SageMaker AI Python SDK directly. For example, you must append the version to the text file when you're using a Jupyter notebook.

When you deploy the endpoint for a SageMaker training job, you must specify the image URI that you're using. If don't provide the image URI, the estimator uses the training image as the image for the deployment. The training images that SageMaker HyperPod provides don't contain the dependencies required for inference and deployment. The following is an example of how an inference image can be used for deployment:

```
from sagemaker import image_uris
container=image_uris.retrieve(framework='pytorch',region='us-west-2',version='2.0',py_version='py310',image_scope='inference', instance_type='ml.p4d.24xlarge')
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p4d.24xlarge',image_uri=container)
```

**Note**  
Running the preceding code on Sagemaker notebook instance might need more than the default 5GB of storage that SageMaker AI JupyterLab provides. If you run into space not available issues, create a new notebook instance where you use a different notebook instance and increase the storage of the notebook.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-recipes"></a>

Update the `./recipes_collection/cluster/sm_jobs.yaml` file to look like the following:

```
sm_jobs_config:
  output_path: <s3_output_path>
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    enable_remote_debug: True
  recipe_overrides:
    exp_manager:
      explicit_log_dir: /opt/ml/output/tensorboard
    data:
      train_dir: /opt/ml/input/data/train
    model:
      model_config: /opt/ml/input/data/train/config.json
    compiler_cache_url: "<compiler_cache_url>"
```

Update `./recipes_collection/config.yaml` to specify `sm_jobs` in the `cluster` and `cluster_type`.

```
defaults:
  - _self_
  - cluster: sm_jobs  # set to `slurm`, `k8s` or `sm_jobs`, depending on the desired cluster
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs  # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
```

Launch the job with the following command

```
python3 main.py --config-path recipes_collection --config-name config
```

For more information about configuring SageMaker training jobs, see Run a training job on SageMaker training jobs.

# Trainium SageMaker training jobs pre-training tutorial
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial"></a>

This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with AWS Trainium instances.
+ Set up your environment
+ Launch a training job

Before you begin, make sure you have the following prerequisites.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Amazon FSx file system or S3 bucket where you can load the data and output the training artifacts.
Request a Service Quota for the `ml.trn1.32xlarge` instance on Amazon SageMaker AI. To request a service quota increase, do the following:  
Navigate to the AWS Service Quotas console.
Choose AWS services.
Select JupyterLab.
Specify one instance for `ml.trn1.32xlarge`.
Create an AWS Identity and Access Management (IAM) role with the `AmazonSageMakerFullAccess` and `AmazonEC2FullAccess` managed policies. These policies provide Amazon SageMaker AI with permissions to run the examples.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up your environment for Trainium SageMaker training jobs
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-environment-setup"></a>

Before you run a SageMaker training job, use the `aws configure` command to configure your AWS credentials and preferred region . As an alternative, you can also provide your credentials through environment variables such as the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN`. For more information, see [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk).

We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).
+ (Optional) If you are using Jupyter notebook in Amazon SageMaker Studio, you can skip running the following command. Make sure to use a version >= python 3.9

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Install SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ 
  + If you are running a llama 3.2 multi-modal training job, the `transformers` version must be `4.45.2` or greater.
    + Append `transformers==4.45.2` to `requirements.txt` in source\$1dir only when you're using the SageMaker AI Python SDK.
    + If you are using HyperPod recipes to launch using `sm_jobs` as the cluster type, you don't have to specify the transformers version.
  + `Container`: The Neuron container is set automatically by SageMaker AI Python SDK.

## Launch the training job with a Jupyter Notebook
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-notebook"></a>

You can use the following Python code to run a SageMaker training job using your recipe. It leverages the PyTorch estimator from the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) to submit the recipe. The following example launches the llama3-8b recipe as a SageMaker AI Training Job.
+ `compiler_cache_url`: Cache to be used to save the compiled artifacts, such as an Amazon S3 artifact.

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "explicit_log_dir": "/opt/ml/output/tensorboard",
    },
    "data": {
        "train_dir": "/opt/ml/input/data/train",
    },
    "model": {
        "model_config": "/opt/ml/input/data/train/config.json",
    },
    "compiler_cache_url": "<compiler_cache_url>"
} 

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-trn",
    role=role,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sagemaker_session,
    training_recipe="training/llama/hf_llama3_70b_seq8k_trn1x16_pretrain",
    recipe_overrides=recipe_overrides,
)

estimator.fit(inputs={"train": "your-inputs"}, wait=True)
```

The preceding code creates a PyTorch estimator object with the training recipe and then fits the model using the `fit()` method. Use the `training_recipe` parameter to specify the recipe you want to use for training.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-recipes"></a>
+ Update `./recipes_collection/cluster/sm_jobs.yaml`
  + compiler\$1cache\$1url: The URL used to save the artifacts. It can be an Amazon S3 URL.

  ```
  sm_jobs_config:
    output_path: <s3_output_path>
    wait: True
    tensorboard_config:
      output_path: <s3_output_path>
      container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
    wait: True  # Whether to wait for training job to finish
    inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
      s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
        train: <s3_train_data_path>
        val: null
    additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
      max_run: 180000
      image_uri: <your_image_uri>
      enable_remote_debug: True
      py_version: py39
    recipe_overrides:
      model:
        exp_manager:
          exp_dir: <exp_dir>
        data:
          train_dir: /opt/ml/input/data/train
          val_dir: /opt/ml/input/data/val
  ```
+ Update `./recipes_collection/config.yaml`

  ```
  defaults:
    - _self_
    - cluster: sm_jobs
    - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
  cluster_type: sm_jobs # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
  
  instance_type: ml.trn1.32xlarge
  base_results_dir: ~/sm_job/hf_llama3_8B # Location to store the results, checkpoints and logs.
  ```
+ Launch the job with `main.py`

  ```
  python3 main.py --config-path recipes_collection --config-name config
  ```

For more information about configuring SageMaker training jobs, see [SageMaker training jobs pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md).