# SageMaker HyperPod recipes
<a name="sagemaker-hyperpod-recipes"></a>

Amazon SageMaker HyperPod recipes are pre-configured training stacks provided by AWS to help you quickly start training and fine-tuning publicly available foundation models (FMs) from various model families such as Llama, Mistral, Mixtral, or DeepSeek. Recipes automate the end-to-end training loop, including loading datasets, applying distributed training techniques, and managing checkpoints for faster recovery from faults. 

SageMaker HyperPod recipes are particularly beneficial for users who may not have deep machine learning expertise, as they abstract away much of the complexity involved in training large models.

You can run recipes within SageMaker HyperPod or as SageMaker training jobs.

The following tables are maintained in the SageMaker HyperPod GitHub repository and provide the most up-to-date information on the models supported for pre-training and fine-tuning, their respective recipes and launch scripts, supported instance types, and more.
+ For the most current list of supported models, recipes, and launch scripts for pre-training, see the [pre-training table](https://github.com/aws/sagemaker-hyperpod-recipes?tab=readme-ov-file#pre-training).
+ For the most current list of supported models, recipes, and launch scripts for fine-tuning, see the [fine-tuning table](https://github.com/aws/sagemaker-hyperpod-recipes?tab=readme-ov-file#fine-tuning).

For SageMaker HyperPod users, the automation of end-to-end training workflows comes from the integration of the training adapter with SageMaker HyperPod recipes. The training adapter is built on the [NVIDIA NeMo framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html) and the [Neuronx Distributed Training package](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html). If you're familiar with using NeMo, the process of using the training adapter is the same. The training adapter runs the recipe on your cluster.

![\[Diagram showing SageMaker HyperPod recipe workflow. A "Recipe" icon at the top feeds into a "HyperPod recipe launcher" box. This box connects to a larger section labeled "Cluster: Slurm, K8s, ..." containing three GPU icons with associated recipe files. The bottom of the cluster section is labeled "Train with HyperPod Training Adapter".\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-hyperpod-recipes-overview.png)


You can also train your own model by defining your own custom recipe.

To get started with a tutorial, see [Tutorials](sagemaker-hyperpod-recipes-tutorials.md).

**Topics**
+ [Tutorials](sagemaker-hyperpod-recipes-tutorials.md)
+ [Default configurations](default-configurations.md)
+ [Cluster-specific configurations](cluster-specific-configurations.md)
+ [Considerations](cluster-specific-configurations-special-considerations.md)
+ [Advanced settings](cluster-specific-configurations-advanced-settings.md)
+ [Appendix](appendix.md)

# Tutorials
<a name="sagemaker-hyperpod-recipes-tutorials"></a>

The following quick-start tutorials help you get started with using the recipes for training:
+ SageMaker HyperPod with Slurm Orchestration
  + Pre-training
    + [HyperPod Slurm cluster pre-training tutorial (GPU)](hyperpod-gpu-slurm-pretrain-tutorial.md)
    + [Trainium Slurm cluster pre-training tutorial](hyperpod-trainium-slurm-cluster-pretrain-tutorial.md)
  + Fine-tuning
    + [HyperPod Slurm cluster PEFT-Lora tutorial (GPU)](hyperpod-gpu-slurm-peft-lora-tutorial.md)
    + [HyperPod Slurm cluster DPO tutorial (GPU)](hyperpod-gpu-slurm-dpo-tutorial.md)
+ SageMaker HyperPod with K8s Orchestration
  + Pre-training
    + [Kubernetes cluster pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial.md)
    + [Trainium SageMaker training jobs pre-training tutorial](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)
+ SageMaker training jobs
  + Pre-training
    + [SageMaker training jobs pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md)
    + [Trainium SageMaker training jobs pre-training tutorial](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)

# HyperPod Slurm cluster pre-training tutorial (GPU)
<a name="hyperpod-gpu-slurm-pretrain-tutorial"></a>

The following tutorial sets up Slurm environment and starts a training job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment to run the recipe, make sure you have:  
Set up a HyperPod GPU Slurm cluster.  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or an NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## HyperPod GPU Slurm environment setup
<a name="hyperpod-gpu-slurm-environment-setup"></a>

To initiate a training job on a HyperPod GPU Slurm cluster, do the following:

1. SSH into the head node of your Slurm cluster.

1. After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

   ```
   #set up a virtual environment
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location.

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
   git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
   cd sagemaker-hyperpod-recipes
   pip3 install -r requirements.txt
   ```

1. Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). To gain a deeper understanding of how to use the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

   ```
   REGION="<region>"
   IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
   aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
   enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
   mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
   ```

1. To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

   ```
   container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
   ```

## Launch the training job
<a name="hyperpod-gpu-slurm-launch-training-job"></a>

After you install the dependencies, start a training job from the `sagemaker-hyperpod-recipes/launcher_scripts` directory. You get the dependencies by cloning the [SageMaker HyperPod recipes repository](https://github.com/aws/sagemaker-hyperpod-recipes):

First, pick your training recipe from Github, the model name is specified as part of the recipe. We use the `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh` script to launch a Llama 8b with sequence length 8192 pre-training recipe, `llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain`, in the following example.
+ `IMAGE`: The container from the environment setup section.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment ouput directory
EXP_DIR="${YOUR_EXP_DIR}"

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
  recipes=training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
  base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
  recipes.run.name="hf_llama3_8b" \
  recipes.exp_manager.exp_dir="$EXP_DIR" \
  recipes.model.data.train_dir="$TRAIN_DIR" \
  recipes.model.data.val_dir="$VAL_DIR" \
  container="${IMAGE}" \
  +cluster.container_mounts.0="/fsx:/fsx"
```

After you've configured all the required parameters in the launcher script, you can run the script using the following command.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# Trainium Slurm cluster pre-training tutorial
<a name="hyperpod-trainium-slurm-cluster-pretrain-tutorial"></a>

The following tutorial sets up a Trainium environment on a Slurm cluster and starts a training job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up a SageMaker HyperPod Trainium Slurm cluster.
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the Trainium environment on the Slurm Cluster
<a name="hyperpod-trainium-slurm-cluster-pretrain-setup-trainium-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the Neuron environment. For information about setting up Neuron, see [Neuron setup steps](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT.html#setting-up-the-environment). We recommend relying on the Deep learning AMI's that come pre-installed with Neuron's drivers, such as [Ubuntu 20 with DLAMI Pytorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-pytorch-dlami.html#setup-torch-neuronx-ubuntu20-dlami-pytorch).
+ Clone the SageMaker HyperPod recipes repository to a shared storage location in the cluster. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Go through the following tutorial: [HuggingFace Llama3-8B Pretraining](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#)
+ Prepare a model configuration. The model configurations available in the Neuron repo. For the model configuration used the in this tutorial, see [llama3 8b model config](https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json)

## Launch the training job in Trainium
<a name="hyperpod-trainium-slurm-cluster-pretrain-launch-training-job-trainium"></a>

To launch a training job in Trainium, specify a cluster configuration and a Neuron recipe. For example, to launch a llama3 8b pre-training job in Trainium, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh`, to the following:
+ `MODEL_CONFIG`: The model config from the environment setup section
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash

#Users should set up their cluster type in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

COMPILE=0
TRAIN_DIR="${TRAIN_DIR}" # Location of training dataset
MODEL_CONFIG="${MODEL_CONFIG}" # Location of config.json for the model

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    instance_type="trn1.32xlarge" \
    recipes.run.compile="$COMPILE" \
    recipes.run.name="hf-llama3-8b" \
    recipes.trainer.num_nodes=4 \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    recipes.data.train_dir="$TRAIN_DIR" \
    recipes.model.model_config="$MODEL_CONFIG"
```

To launch the training job, run the following command:

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# HyperPod Slurm cluster DPO tutorial (GPU)
<a name="hyperpod-gpu-slurm-dpo-tutorial"></a>

The following tutorial sets up a Slurm environment and starts a direct preference optimization (DPO) job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up HyperPod GPU Slurm cluster  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
A tokenized binary preference dataset in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the HyperPod GPU Slurm environment
<a name="hyperpod-gpu-slurm-dpo-hyperpod-gpu-slurm-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). For more information about using the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## Launch the training job
<a name="hyperpod-gpu-slurm-dpo-launch-training-job"></a>

To launch a DPO job for the Llama 8 billion parameter model with a sequence length of 8192 on a single Slurm compute node, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh`, to the following:
+ `IMAGE`: The container from the environment setup section.
+ `HF_MODEL_NAME_OR_PATH`: Define the name or the path of the pre-trained weights in the hf\$1model\$1name\$1or\$1path parameter of the recipe.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

**Note**  
The reference model used for DPO in this setup is automatically derived from the base model being trained (no separate reference model is explicitly defined). DPO specific hyperparameters are preconfigured with the following default values:  
`beta`: 0.1 (controls the strength of KL divergence regularization)
`label_smoothing`: 0.0 (no smoothing applied to preference labels)

```
recipes.dpo.beta=${BETA}
recipes.dpo.label_smoothing=${LABEL_SMOOTHING}
```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset
# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${HF_MODEL_NAME_OR_PATH}"
BETA="${BETA}"
LABEL_SMOOTHING="${LABEL_SMOOTHING}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_dpo \
base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
recipes.run.name="hf_llama3_dpo" \
recipes.exp_manager.exp_dir="$EXP_DIR" \
recipes.model.data.train_dir="$TRAIN_DIR" \
recipes.model.data.val_dir="$VAL_DIR" \
recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
container="${IMAGE}" \
+cluster.container_mounts.0="/fsx:/fsx" \
recipes.model.hf_access_token="${HF_ACCESS_TOKEN}" \
recipes.dpo.enabled=true \
recipes.dpo.beta="${BETA}" \
recipes.dpo.label_smoothing="${LABEL_SMOOTHING}$" \
```

After you've configured all the required parameters in the preceding script, you can initiate the training job by running it.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# HyperPod Slurm cluster PEFT-Lora tutorial (GPU)
<a name="hyperpod-gpu-slurm-peft-lora-tutorial"></a>

The following tutorial sets up Slurm environment and starts a parameter-efficient fine-tuning (PEFT) job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up HyperPod GPU Slurm cluster  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the HyperPod GPU Slurm environment
<a name="hyperpod-gpu-slurm-peft-lora-setup-hyperpod-gpu-slurm-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). For more information about using the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## Launch the training job
<a name="hyperpod-gpu-slurm-peft-lora-launch-training-job"></a>

To launch a PEFT job for the Llama 8 billion parameter model with a sequence length of 8192 on a single Slurm compute node, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh`, to the following:
+ `IMAGE`: The container from the environment setup section.
+ `HF_MODEL_NAME_OR_PATH`: Define the name or the path of the pre-trained weights in the hf\$1model\$1name\$1or\$1path parameter of the recipe.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${YOUR_HF_MODEL_NAME_OR_PATH}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
    recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_lora \
    base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
    recipes.run.name="hf_llama3_lora" \
    recipes.exp_manager.exp_dir="$EXP_DIR" \
    recipes.model.data.train_dir="$TRAIN_DIR" \
    recipes.model.data.val_dir="$VAL_DIR" \
    recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
    container="${IMAGE}" \
    +cluster.container_mounts.0="/fsx:/fsx" \
    recipes.model.hf_access_token="${HF_ACCESS_TOKEN}"
```

After you've configured all the required parameters in the preceding script, you can initiate the training job by running it.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# Kubernetes cluster pre-training tutorial (GPU)
<a name="sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial"></a>

There are two ways to launch a training job in a GPU Kubernetes cluster:
+ (Recommended) [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli)
+ The NeMo style launcher

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
A HyperPod GPU Kubernetes cluster is setup properly.
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## GPU Kubernetes environment setup
<a name="sagemaker-hyperpod-gpu-kubernetes-environment-setup"></a>

To set up a GPU Kubernetes environment, do the following:
+ Set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Install dependencies using one of the following methods:
  + (Recommended): [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli) method:

    ```
    # install HyperPod command line tools
    git clone https://github.com/aws/sagemaker-hyperpod-cli
    cd sagemaker-hyperpod-cli
    pip3 install .
    ```
  + SageMaker HyperPod recipes method:

    ```
    # install SageMaker HyperPod Recipes.
    git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
    cd sagemaker-hyperpod-recipes
    pip3 install -r requirements.txt
    ```
+ [Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
+ [Install Helm](https://helm.sh/docs/intro/install/)
+ Connect to your Kubernetes cluster

  ```
  aws eks update-kubeconfig --region "CLUSTER_REGION" --name "CLUSTER_NAME"
  hyperpod connect-cluster --cluster-name "CLUSTER_NAME" [--region "CLUSTER_REGION"] [--namespace <namespace>]
  ```

## Launch the training job with the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-cli"></a>

We recommend using the SageMaker HyperPod command-line interface (CLI) tool to submit your training job with your configurations. The following example submits a training job for the `hf_llama3_8b_seq16k_gpu_p5x16_pretrain` model.
+ `your_training_container`: A Deep Learning container. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
"recipes.run.name": "hf-llama3-8b",
"recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
"container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.model.data.train_dir": "<your_train_data_dir>",
"recipes.model.data.val_dir": "<your_val_data_dir>",
"cluster": "k8s",
"cluster_type": "k8s"
}'
```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` becomes `Completed` when you run `kubectl get pods`.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-recipes"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating `k8s.yaml`, `config.yaml`, and running the launch script.
+ In `k8s.yaml`, update `persistent_volume_claims`. It mounts the Amazon FSx claim to the `/data` directory of each computing pod

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ In `config.yaml`, update `repo_url_or_path` under `git`.

  ```
  git:
    repo_url_or_path: <training_adapter_repo>
    branch: null
    commit: null
    entry_script: null
    token: null
  ```
+ Update `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh`
  + `your_contrainer`: A Deep Learning container. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).
  + (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

    ```
    recipes.model.hf_access_token=<your_hf_token>
    ```

  ```
  #!/bin/bash
  #Users should setup their cluster type in /recipes_collection/config.yaml
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, ect
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
      recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
      base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
      recipes.run.name="hf-llama3" \
      recipes.exp_manager.exp_dir="$EXP_DIR" \
      cluster=k8s \
      cluster_type=k8s \
      container="${IMAGE}" \
      recipes.model.data.train_dir=$TRAIN_DIR \
      recipes.model.data.val_dir=$VAL_DIR
  ```
+ Launch the training job

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
  ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
```

```
NAME READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod <name-of-pod>
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

For more information about the k8s cluster configuration, see [Running a training job on HyperPod k8s](cluster-specific-configurations-run-training-job-hyperpod-k8s.md).

# Trainium Kubernetes cluster pre-training tutorial
<a name="sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial"></a>

You can use one of the following methods to start a training job in a Trainium Kubernetes cluster.
+ (Recommended) [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli)
+ The NeMo style launcher

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up a HyperPod Trainium Kubernetes cluster
A shared storage location that can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up your Trainium Kubernetes environment
<a name="sagemaker-hyperpod-trainium-setup-trainium-kubernetes-environment"></a>

To set up the Trainium Kubernetes environment, do the following:

1. Complete the steps in the following tutorial: [HuggingFace Llama3-8B Pretraining](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#download-the-dataset) starting from **Download the dataset**. 

1. Prepare a model configuration. They're available in the Neuron repo. For this tutorial, you can use the llama3 8b model config.

1. Virtual environment setup. Make sure you're using Python 3.9 or greater.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. Install the dependencies
   + (Recommended) Use the following HyperPod command-line tool

     ```
     # install HyperPod command line tools
     git clone https://github.com/aws/sagemaker-hyperpod-cli
     cd sagemaker-hyperpod-cli
     pip3 install .
     ```
   + If you're using SageMaker HyperPod recipes, specify the following

     ```
     # install SageMaker HyperPod Recipes.
     git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
     cd sagemaker-hyperpod-recipes
     pip3 install -r requirements.txt
     ```

1. [Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   hyperpod connect-cluster --cluster-name "${CLUSTER_NAME}" [--region "${CLUSTER_REGION}"] [--namespace <namespace>]
   ```

1. Container: The [Neuron container](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx)

## Launch the training job with the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-trainium-launch-training-job-cli"></a>

We recommend using the SageMaker HyperPod command-line interface (CLI) tool to submit your training job with your configurations. The following example submits a training job for the `hf_llama3_8b_seq8k_trn1x4_pretrain` Trainium model.
+ `your_neuron_container`: The [Neuron container](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx).
+ `your_model_config`: The model configuration from the environment setup section
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
 "cluster": "k8s",
 "cluster_type": "k8s",
 "container": "<your_neuron_contrainer>",
 "recipes.run.name": "hf-llama3",
 "recipes.run.compile": 0,
 "recipes.model.model_config": "<your_model_config>",
 "instance_type": "trn1.32xlarge",
 "recipes.data.train_dir": "<your_train_data_dir>"
}'
```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                              READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-trainium-launch-training-job-recipes"></a>

Alternatively, use SageMaker HyperPod recipes to submit your training job. To submit the training job using a recipe, update `k8s.yaml` and `config.yaml`. Run the bash script for the model to launch it.
+ In `k8s.yaml`, update persistent\$1volume\$1claims to mount the Amazon FSx claim to the /data directory in the compute nodes

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ Update launcher\$1scripts/llama/run\$1hf\$1llama3\$18b\$1seq8k\$1trn1x4\$1pretrain.sh
  + `your_neuron_contrainer`: The container from the environment setup section
  + `your_model_config`: The model config from the environment setup section

  (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

  ```
   #!/bin/bash
  #Users should set up their cluster type in /recipes_collection/config.yaml
  IMAGE="<your_neuron_contrainer>"
  MODEL_CONFIG="<your_model_config>"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    recipes.run.name="hf-llama3-8b" \
    instance_type=trn1.32xlarge \
    recipes.model.model_config="$MODEL_CONFIG" \
    cluster=k8s \
    cluster_type=k8s \
    container="${IMAGE}" \
    recipes.data.train_dir=$TRAIN_DIR \
    recipes.data.val_dir=$VAL_DIR
  ```
+ Launch the job

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
  ```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is at `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

For more information about the k8s cluster configuration, see [Trainium Kubernetes cluster pre-training tutorial](#sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial).

# SageMaker training jobs pre-training tutorial (GPU)
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial"></a>

This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with GPU instances.
+ Set up your environment
+ Launch a training job using SageMaker HyperPod recipes

Before you begin, make sure you have following prerequisites.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Amazon FSx file system or an Amazon S3 bucket where you can load the data and output the training artifacts.
Requested a Service Quota for 1x ml.p4d.24xlarge and 1x ml.p5.48xlarge on Amazon SageMaker AI. To request a service quota increase, do the following:  
On the AWS Service Quotas console, navigate to AWS services,
Choose **Amazon SageMaker AI**.
Choose one ml.p4d.24xlarge and one ml.p5.48xlarge instance.
Create an AWS Identity and Access Management(IAM) role with the following managed policies to give SageMaker AI permissions to run the examples.  
AmazonSageMakerFullAccess
AmazonEC2FullAccess
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## GPU SageMaker training jobs environment setup
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-environment-setup"></a>

Before you run a SageMaker training job, configure your AWS credentials and preferred region by running the `aws configure` command. As an alternative to the configure command, you can provide your credentials through environment variables such as `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN.` For more information, see [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk).

We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).
+ (Optional) Set up the virtual environment and dependencies. If you are using a Jupyter notebook in Amazon SageMaker Studio, you can skip this step. Make sure you're using Python 3.9 or greater.

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  # Set the aws region.
  
  aws configure set <your_region>
  ```
+ Install SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ `Container`: The GPU container is set automatically by the SageMaker AI Python SDK. You can also provide your own container.
**Note**  
If you're running a Llama 3.2 multi-modal training job, the `transformers` version must be `4.45.2 `or greater.

  Append `transformers==4.45.2` to `requirements.txt` in `source_dir` only when you're using the SageMaker AI Python SDK. For example, append it if you're using it in a notebook in SageMaker AI JupyterLab.

  If you are using HyperPod recipes to launch using cluster type `sm_jobs`, this will be done automatically.

## Launch the training job using a Jupyter Notebook
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-notebook"></a>

You can use the following Python code to run a SageMaker training job with your recipe. It leverages the PyTorch estimator from the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) to submit the recipe. The following example launches the llama3-8b recipe on the SageMaker AI Training platform.

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket() 
output = os.path.join(f"s3://{bucket}", "output")
output_path = "<s3-URI>"

overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-recipe",
    role=role,
    instance_type="ml.p5.48xlarge",
    training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
    recipe_overrides=recipe_overrides,
    sagemaker_session=sagemaker_session,
    tensorboard_output_config=tensorboard_output_config,
)

estimator.fit(inputs={"train": "s3 or fsx input", "val": "s3 or fsx input"}, wait=True)
```

The preceding code creates a PyTorch estimator object with the training recipe and then fits the model using the `fit()` method. Use the training\$1recipe parameter to specify the recipe you want to use for training.

**Note**  
If you're running a Llama 3.2 multi-modal training job, the transformers version must be 4.45.2 or greater.

Append `transformers==4.45.2` to `requirements.txt` in `source_dir` only when you're using SageMaker AI Python SDK directly. For example, you must append the version to the text file when you're using a Jupyter notebook.

When you deploy the endpoint for a SageMaker training job, you must specify the image URI that you're using. If don't provide the image URI, the estimator uses the training image as the image for the deployment. The training images that SageMaker HyperPod provides don't contain the dependencies required for inference and deployment. The following is an example of how an inference image can be used for deployment:

```
from sagemaker import image_uris
container=image_uris.retrieve(framework='pytorch',region='us-west-2',version='2.0',py_version='py310',image_scope='inference', instance_type='ml.p4d.24xlarge')
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p4d.24xlarge',image_uri=container)
```

**Note**  
Running the preceding code on Sagemaker notebook instance might need more than the default 5GB of storage that SageMaker AI JupyterLab provides. If you run into space not available issues, create a new notebook instance where you use a different notebook instance and increase the storage of the notebook.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-recipes"></a>

Update the `./recipes_collection/cluster/sm_jobs.yaml` file to look like the following:

```
sm_jobs_config:
  output_path: <s3_output_path>
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    enable_remote_debug: True
  recipe_overrides:
    exp_manager:
      explicit_log_dir: /opt/ml/output/tensorboard
    data:
      train_dir: /opt/ml/input/data/train
    model:
      model_config: /opt/ml/input/data/train/config.json
    compiler_cache_url: "<compiler_cache_url>"
```

Update `./recipes_collection/config.yaml` to specify `sm_jobs` in the `cluster` and `cluster_type`.

```
defaults:
  - _self_
  - cluster: sm_jobs  # set to `slurm`, `k8s` or `sm_jobs`, depending on the desired cluster
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs  # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
```

Launch the job with the following command

```
python3 main.py --config-path recipes_collection --config-name config
```

For more information about configuring SageMaker training jobs, see Run a training job on SageMaker training jobs.

# Trainium SageMaker training jobs pre-training tutorial
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial"></a>

This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with AWS Trainium instances.
+ Set up your environment
+ Launch a training job

Before you begin, make sure you have the following prerequisites.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Amazon FSx file system or S3 bucket where you can load the data and output the training artifacts.
Request a Service Quota for the `ml.trn1.32xlarge` instance on Amazon SageMaker AI. To request a service quota increase, do the following:  
Navigate to the AWS Service Quotas console.
Choose AWS services.
Select JupyterLab.
Specify one instance for `ml.trn1.32xlarge`.
Create an AWS Identity and Access Management (IAM) role with the `AmazonSageMakerFullAccess` and `AmazonEC2FullAccess` managed policies. These policies provide Amazon SageMaker AI with permissions to run the examples.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up your environment for Trainium SageMaker training jobs
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-environment-setup"></a>

Before you run a SageMaker training job, use the `aws configure` command to configure your AWS credentials and preferred region . As an alternative, you can also provide your credentials through environment variables such as the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN`. For more information, see [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk).

We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).
+ (Optional) If you are using Jupyter notebook in Amazon SageMaker Studio, you can skip running the following command. Make sure to use a version >= python 3.9

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Install SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ 
  + If you are running a llama 3.2 multi-modal training job, the `transformers` version must be `4.45.2` or greater.
    + Append `transformers==4.45.2` to `requirements.txt` in source\$1dir only when you're using the SageMaker AI Python SDK.
    + If you are using HyperPod recipes to launch using `sm_jobs` as the cluster type, you don't have to specify the transformers version.
  + `Container`: The Neuron container is set automatically by SageMaker AI Python SDK.

## Launch the training job with a Jupyter Notebook
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-notebook"></a>

You can use the following Python code to run a SageMaker training job using your recipe. It leverages the PyTorch estimator from the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) to submit the recipe. The following example launches the llama3-8b recipe as a SageMaker AI Training Job.
+ `compiler_cache_url`: Cache to be used to save the compiled artifacts, such as an Amazon S3 artifact.

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "explicit_log_dir": "/opt/ml/output/tensorboard",
    },
    "data": {
        "train_dir": "/opt/ml/input/data/train",
    },
    "model": {
        "model_config": "/opt/ml/input/data/train/config.json",
    },
    "compiler_cache_url": "<compiler_cache_url>"
} 

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-trn",
    role=role,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sagemaker_session,
    training_recipe="training/llama/hf_llama3_70b_seq8k_trn1x16_pretrain",
    recipe_overrides=recipe_overrides,
)

estimator.fit(inputs={"train": "your-inputs"}, wait=True)
```

The preceding code creates a PyTorch estimator object with the training recipe and then fits the model using the `fit()` method. Use the `training_recipe` parameter to specify the recipe you want to use for training.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-recipes"></a>
+ Update `./recipes_collection/cluster/sm_jobs.yaml`
  + compiler\$1cache\$1url: The URL used to save the artifacts. It can be an Amazon S3 URL.

  ```
  sm_jobs_config:
    output_path: <s3_output_path>
    wait: True
    tensorboard_config:
      output_path: <s3_output_path>
      container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
    wait: True  # Whether to wait for training job to finish
    inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
      s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
        train: <s3_train_data_path>
        val: null
    additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
      max_run: 180000
      image_uri: <your_image_uri>
      enable_remote_debug: True
      py_version: py39
    recipe_overrides:
      model:
        exp_manager:
          exp_dir: <exp_dir>
        data:
          train_dir: /opt/ml/input/data/train
          val_dir: /opt/ml/input/data/val
  ```
+ Update `./recipes_collection/config.yaml`

  ```
  defaults:
    - _self_
    - cluster: sm_jobs
    - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
  cluster_type: sm_jobs # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
  
  instance_type: ml.trn1.32xlarge
  base_results_dir: ~/sm_job/hf_llama3_8B # Location to store the results, checkpoints and logs.
  ```
+ Launch the job with `main.py`

  ```
  python3 main.py --config-path recipes_collection --config-name config
  ```

For more information about configuring SageMaker training jobs, see [SageMaker training jobs pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md).

# Default configurations
<a name="default-configurations"></a>

This section outlines the essential components and settings required to initiate and customize your Large Language Model (LLM) training processes using SageMaker HyperPod. This section covers the key repositories, configuration files, and recipe structures that form the foundation of your training jobs. Understanding these default configurations is crucial for effectively setting up and managing your LLM training workflows, whether you're using pre-defined recipes or customizing them to suit your specific needs.

**Topics**
+ [GitHub repositories](github-repositories.md)
+ [General configuration](sagemaker-hyperpod-recipes-general-configuration.md)

# GitHub repositories
<a name="github-repositories"></a>

To launch a training job, you utilize files from two distinct GitHub repositories:
+ [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes)
+ [SageMaker HyperPod training adapter for NeMo](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo)

These repositories contain essential components for initiating, managing, and customizing Large Language Model (LLM) training processes. You use the scripts from the repositories to set up and run the training jobs for your LLMs.

## HyperPod recipe repository
<a name="sagemaker-hyperpod-recipe-repository"></a>

Use the [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes) repository to get a recipe.

1. `main.py`: This file serves as the primary entry point for initiating the process of submitting a training job to either a cluster or a SageMaker training job.

1. `launcher_scripts`: This directory contains a collection of commonly used scripts designed to facilitate the training process for various Large Language Models (LLMs).

1. `recipes_collection`: This folder houses a compilation of pre-defined LLM recipes provided by the developers. Users can leverage these recipes in conjunction with their custom data to train LLM models tailored to their specific requirements.

You use the SageMaker HyperPod recipes to launch training or fine-tuning jobs. Regardless of the cluster you're using, the process of submitting the job is the same. For example, you can use the same script to submit a job to a Slurm or Kubernetes cluster. The launcher dispatches a training job based on three configuration files:

1. General Configuration (`config.yaml`): Includes common settings such as the default parameters or environment variables used in the training job.

1. Cluster Configuration (cluster): For training jobs using clusters only. If you're submitting a training job to a Kubernetes cluster, you might need to specify information such as volume, label, or restart policy. For Slurm clusters, you might need to specify the Slurm job name. All the parameters are related to the specific cluster that you're using.

1. Recipe (recipes): Recipes contain the settings for your training job, such as the model types, sharding degree, or dataset paths. For example, you can specify Llama as your training model and train it using model or data parallelism techniques like Fully Sharded Distributed Parallel (FSDP) across eight machines. You can also specify different checkpoint frequencies or paths for your training job.

After you've specified a recipe, you run the launcher script to specify an end-to-end training job on a cluster based on the configurations through the `main.py` entry point. For each recipe that you use, there are accompanying shell scripts located in the launch\$1scripts folder. These examples guide you through submitting and initiating training jobs. The following figure illustrates how a SageMaker HyperPod recipe launcher submits a training job to a cluster based on the preceding. Currently, the SageMaker HyperPod recipe launcher is built on top of the Nvidia NeMo Framework Launcher. For more information, see [NeMo Launcher Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).

![\[Diagram illustrating the HyperPod recipe launcher workflow. On the left, inside a dashed box, are three file icons labeled "Recipe", "config.yaml", and "slurm.yaml or k8s.yaml or sm_job.yaml (Cluster config)". An arrow points from this box to a central box labeled "HyperPod recipe Launcher". From this central box, another arrow points right to "Training Job", with "main.py" written above the arrow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-hyperpod-recipe-launcher.png)


## HyperPod recipe adapter repository
<a name="hyperpod-recipe-adapter"></a>

The SageMaker HyperPod training adapter is a training framework. You can use it to manage the entire lifecycle of your training jobs. Use the adapter to distribute the pre-training or fine-tuning of your models across multiple machines. The adaptor uses different parallelism techniques to distribute the training. It also handles the implementation and management of saving the checkpoints. For more details, see [Advanced settings](cluster-specific-configurations-advanced-settings.md).

Use the [SageMaker HyperPod recipe adapter repository](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo) to use the recipe adapter.

1. `src`: This directory contains the implementation of Large-scale Language Model (LLM) training, encompassing various features such as model parallelism, mixed-precision training, and checkpointing management.

1. `examples`: This folder provides a collection of examples demonstrating how to create an entry point for training an LLM model, serving as a practical guide for users.

# General configuration
<a name="sagemaker-hyperpod-recipes-general-configuration"></a>

The config.yaml file specifies the training recipe and the cluster. It also includes runtime configurations such as environment variables for the training job.

```
defaults:
  - _self_
  - cluster: slurm 
  - recipes: training/llama/hf_llama3_8b_seq8192_gpu
instance_type: p5.48xlarge
git:
  repo_url_or_path: null
  branch: null
  commit: null
  entry_script: null
  token: null
env_vars:
  NCCL_DEBUG: WARN
```

You can modify the following parameters in `config.yaml`:

1. `defaults`: Specify your default settings, such as the default cluster or default recipes.

1. `instance_type`: Modify the Amazon EC2 instance type to match the instance type that you're using.

1. `git`: Specify the location of the SageMaker HyperPod recipe adapter repository for the training job.

1. `env_vars`: You can specify the environment variables to be passed into your runtime training job. For example, you can adjust the logging level of NCCL by specifying the NCCL\$1DEBUG environment variable.

The recipe is the core configuration that defines your training job architecture. This file includes many important pieces of information for your training job, such as the following:
+ Whether to use model parallelism
+ The source of your datasets
+ Mixed precision training
+ Checkpointing-related configurations

You can use the recipes as-is. You can also use the following information to modify them.

## run
<a name="run"></a>

The following is the basic run information for running your training job.

```
run:
  name: llama-8b
  results_dir: ${base_results_dir}/${.name}
  time_limit: "6-00:00:00"
  model_type: hf
```

1. `name`: Specify the name for your training job in the configuration file.

1. `results_dir`: You can specify the directory where the results of your training job are stored.

1. `time_limit`: You can set a maximum training time for your training job to prevent it from occupying hardware resources for too long.

1. `model_type`: You can specify the type of model you are using. For example, you can specify `hf` if your model is from HuggingFace.

## exp\$1manager
<a name="exp-manager"></a>

The exp\$1manager configures the experiment. With the exp\$1manager, you can specify fields such as the output directory or checkpoint settings. The following is an example of how you can configure the exp\$1manager.

```
exp_manager:
  exp_dir: null
  name: experiment
  create_tensorboard_logger: True
```

1. `exp_dir`: The experiment directory includes the standard output and standard error files for your training job. By default, it uses your current directory.

1. `name`: The experiment name used to identify your experiment under the exp\$1dir.

1. `create_tensorboard_logger`: Specify `True` or `False` to enable or disable the TensorBoard logger.

## Checkpointing
<a name="checkpointing"></a>

Here are three types of checkpointing we support:
+ Auto checkpointing
+ Manual checkpointing
+ Full checkpointing

### Auto checkpointing
<a name="auto-checkpointing"></a>

If you're saving or loading checkpoints that are automatically managed by the SageMaker HyperPod recipe adapter, you can enable `auto_checkpoint`. To enable `auto_checkpoint`, set `enabled` to `True`. You can use auto checkpointing for both training and fine-tuning. You can use auto checkpoinitng for both shared file systems and Amazon S3.

```
exp_manager
  checkpoint_dir: ${recipes.exp_manager.exp_dir}/checkpoints/
  auto_checkpoint:
    enabled: True
```

Auto checkpoint is saving the local\$1state\$1dict asynchronously with an automatically computed optimal saving interval.

**Note**  
Under this checkpointing mode, the auto saved checkpointing doesn't support re-sharding between training runs. To resume from the latest auto saved checkpoint, you must preserve the same shard degrees. You don't need to specify extra information to auto resume.

### Manual checkpointing
<a name="manual-checkpointing"></a>

You can modify `checkpoint_callback_params` to asynchronously save an intermediate checkpoint in shared\$1state\$1dict. For example, you can specify the following configuration to enable sharded checkpointing every 10 steps and keep the latest 3 checkpoints.

Sharded checkpointing allows you to change the shard degrees between training runs and load the checkpoint by setting `resume_from_checkpoint`.

**Note**  
If is a PEFT fine tuning, sharded checkpointing doesn't support Amazon S3.
Auto and manual checkpointing are mutually exclusive.
Only FSDP shard degrees and replication degrees changes are allowed.

```
exp_manager:
  checkpoint_callback_params:
    # Set save_top_k = 0 to disable sharded checkpointing
    save_top_k: 3
    every_n_train_steps: 10
    monitor: "step"
    mode: "max"
    save_last: False
  resume_from_checkpoint: ${recipes.exp_manager.exp_dir}/checkpoints/
```

To learn more about checkpointing, see [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md).

### Full checkpointing
<a name="full-checkpointing"></a>

The exported full\$1state\$1dict checkpoint can be used for inference or fine tuning. You can load a full checkpoint through hf\$1model\$1name\$1or\$1path. Under this mode, only the model weights are saved.

To export the full\$1state\$1dict model, you can set the following parameters.

**Note**  
Currently, full checkpointing isn't supported for Amazon S3 checkpointing. You can't set the S3 path for `exp_manager.checkpoint_dir` if you're enabling full checkpointing. However, you can set `exp_manager.export_full_model.final_export_dir` to a specific directory on your local filesystem while setting `exp_manager.checkpoint_dir` to an Amazon S3 path.

```
exp_manager:
  export_full_model:
    # Set every_n_train_steps = 0 to disable full checkpointing
    every_n_train_steps: 0
    save_last: True
    final_export_dir : null
```

## model
<a name="model"></a>

Define various aspects of your model architecture and training process. This includes settings for model parallelism, precision, and data handling. Below are the key components you can configure within the model section:

### model parallelism
<a name="model-parallelism"></a>

After you've specified the recipe, you define the model that you're training. You can also define the model parallelism. For example, you can define tensor\$1model\$1parallel\$1degree. You can enable other features like training with FP8 precision. For example, you can train a model with tensor parallelism and context parallelism:

```
model:
  model_type: llama_v3
  # Base configs
  train_batch_size: 4
  val_batch_size: 1
  seed: 12345
  grad_clip: 1.0

  # Model parallelism
  tensor_model_parallel_degree: 4
  expert_model_parallel_degree: 1
  context_parallel_degree: 2
```

To gain a better understanding of different types of model parallelism techniques, you can refer to the following approaches:

1. [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md)

1. [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md)

1. [Context parallelism](model-parallel-core-features-v2-context-parallelism.md)

1. [Hybrid sharded data parallelism](model-parallel-core-features-v2-sharded-data-parallelism.md)

### FP8
<a name="fp8"></a>

To enable FP8 (8-bit floating-point precision), you can specify the FP8-related configuration in the following example:

```
model:
  # FP8 config
  fp8: True
  fp8_amax_history_len: 1024
  fp8_amax_compute_algo: max
```

It's important to note that the FP8 data format is currently supported only on the P5 instance type. If you are using an older instance type, such as P4, disable the FP8 feature for your model training process. For more information about FP8, see [Mixed precision training](model-parallel-core-features-v2-mixed-precision.md).

### data
<a name="data"></a>

You can specify your custom datasets for your training job by adding the data paths under data. The data module in our system supports the following data formats:

1. JSON

1. JSONGZ (Compressed JSON)

1. ARROW

However, you are responsible for preparing your own pre-tokenized dataset. If you're an advanced user with specific requirements, there is also an option to implement and integrate a customized data module. For more information on HuggingFace datasets, see [Datasets](https://huggingface.co/docs/datasets/v3.1.0/en/index).

```
model:
  data:
    train_dir: /path/to/your/train/data
    val_dir: /path/to/your/val/data
    dataset_type: hf
    use_synthetic_data: False
```

You can specify how you're training the model. By default, the recipe uses pre-training instead of fine-tuning. The following example configures the recipe to run a fine-tuning job with LoRA (Low-Rank Adaptation).

```
model:
  # Fine tuning config
  do_finetune: True
  # The path to resume from, needs to be HF compatible
  hf_model_name_or_path: null
  hf_access_token: null
  # PEFT config
  peft:
    peft_type: lora
    rank: 32
    alpha: 16
    dropout: 0.1
```

For information about the recipes, see [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes).

# Cluster-specific configurations
<a name="cluster-specific-configurations"></a>

SageMaker HyperPod offers flexibility in running training jobs across different cluster environments. Each environment has its own configuration requirements and setup process. This section outlines the steps and configurations needed for running training jobs in SageMaker HyperPod Slurm, SageMaker HyperPod k8s, and SageMaker training jobs. Understanding these configurations is crucial for effectively leveraging the power of distributed training in your chosen environment.

You can use a recipe in the following cluster environments:
+ SageMaker HyperPod Slurm Orchestration
+ SageMaker HyperPod Amazon Elastic Kubernetes Service Orchestration
+ SageMaker training jobs

To launch a training job in a cluster, set and install the corresponding cluster configuration and environment.

**Topics**
+ [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md)
+ [Running a training job on HyperPod k8s](cluster-specific-configurations-run-training-job-hyperpod-k8s.md)
+ [Running a SageMaker training job](cluster-specific-configurations-run-sagemaker-training-job.md)

# Running a training job on HyperPod Slurm
<a name="cluster-specific-configurations-run-training-job-hyperpod-slurm"></a>

SageMaker HyperPod Recipes supports submitting a training job to a GPU/Trainium slurm cluster. Before you submit the training job, update the cluster configuration. Use one of the following methods to update the cluster configuration:
+ Modify `slurm.yaml`
+ Override it through the command line

After you've updated the cluster configuration, install the environment.

## Configure the cluster
<a name="cluster-specific-configurations-configure-cluster-slurm-yaml"></a>

To submit a training job to a Slurm cluster, specify the Slurm-specific configuration. Modify `slurm.yaml` to configure the Slurm cluster. The following is an example of a Slurm cluster configuration. You can modify this file for your own training needs:

```
job_name_prefix: 'sagemaker-'
slurm_create_submission_file_only: False 
stderr_to_stdout: True
srun_args:
  # - "--no-container-mount-home"
slurm_docker_cfg:
  docker_args:
    # - "--runtime=nvidia" 
  post_launch_commands: 
container_mounts: 
  - "/fsx:/fsx"
```

1. `job_name_prefix`: Specify a job name prefix to easily identify your submissions to the Slurm cluster.

1. `slurm_create_submission_file_only`: Set this configuration to True for a dry run to help you debug.

1. `stderr_to_stdout`: Specify whether you're redirecting your standard error (stderr) to standard output (stdout).

1. `srun_args`: Customize additional srun configurations, such as excluding specific compute nodes. For more information, see the srun documentation.

1. `slurm_docker_cfg`: The SageMaker HyperPod recipe launcher launches a Docker container to run your training job. You can specify additional Docker arguments within this parameter.

1. `container_mounts`: Specify the volumes you're mounting into the container for the recipe launcher, for your training jobs to access the files in those volumes.

# Running a training job on HyperPod k8s
<a name="cluster-specific-configurations-run-training-job-hyperpod-k8s"></a>

SageMaker HyperPod Recipes supports submitting a training job to a GPU/Trainium Kubernetes cluster. Before you submit the training job do one of the following:
+ Modify the `k8s.yaml` cluster configuration file
+ Override the cluster configuration through the command line

After you've done either of the preceding steps, install the corresponding environment.

## Configure the cluster using `k8s.yaml`
<a name="cluster-specific-configurations-configure-cluster-k8s-yaml"></a>

To submit a training job to a Kubernetes cluster, you specify Kubernetes-specific configurations. The configurations include the cluster namespace or the location of the persistent volume.

```
pullPolicy: Always
restartPolicy: Never
namespace: default
persistent_volume_claims:
  - null
```

1. `pullPolicy`: You can specify the pull policy when you submit a training job. If you specify "Always," the Kubernetes cluster always pulls your image from the repository. For more information, see [Image pull policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy).

1. `restartPolicy`: Specify whether to restart your training job if it fails.

1. `namespace`: You can specify the Kubernetes namespace where you're submitting the training job.

1. `persistent_volume_claims`: You can specify a shared volume for your training job for all training processes to access the files in the volume.

# Running a SageMaker training job
<a name="cluster-specific-configurations-run-sagemaker-training-job"></a>

SageMaker HyperPod Recipes supports submitting a SageMaker training job. Before you submit the training job, you must update the cluster configuration, `sm_job.yaml`, and install corresponding environment.

## Use your recipe as a SageMaker training job
<a name="cluster-specific-configurations-cluster-config-sm-job-yaml"></a>

You can use your recipe as a SageMaker training job if you aren't hosting a cluster. You must modify the SageMaker training job configuration file, `sm_job.yaml`, to run your recipe.

```
sm_jobs_config:
  output_path: null 
  tensorboard_config:
    output_path: null 
    container_logs_path: null
  wait: True 
  inputs: 
    s3: 
      train: null
      val: null
    file_system:  
      directory_path: null
  additional_estimator_kwargs: 
    max_run: 1800
```

1. `output_path`: You can specify where you're saving your model to an Amazon S3 URL.

1. `tensorboard_config`: You can specify a TensorBoard related configuration such as the output path or TensorBoard logs path.

1. `wait`: You can specify whether you're waiting for the job to be completed when you submit your training job.

1. `inputs`: You can specify the paths for your training and validation data. The data source can be from a shared filesystem such as Amazon FSx or an Amazon S3 URL.

1. `additional_estimator_kwargs`: Additional estimator arguments for submitting a training job to the SageMaker training job platform. For more information, see [Algorithm Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html).

# Considerations
<a name="cluster-specific-configurations-special-considerations"></a>

When you're using a Amazon SageMaker HyperPod recipes, there are some factors that can impact the process of model training.
+ The `transformers` version must be `4.45.2` or greater for Llama 3.2. If you're using a Slurm or K8s workflow, the version is automatically updated.
+ Mixtral does not support 8-bit floating point precision (FP8)
+ Amazon EC2 p4 instance does not support FP8

# Advanced settings
<a name="cluster-specific-configurations-advanced-settings"></a>

The SageMaker HyperPod recipe adapter is built on top of the Nvidia Nemo and Pytorch-lightning frameworks. If you've already used these frameworks, integrating your custom models or features into the SageMaker HyperPod recipe adapter is a similar process. In addition to modifying the recipe adapter, you can change your own pre-training or fine-tuning script. For guidance on writing your custom training script, see [examples](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/tree/main/examples).

## Use the SageMaker HyperPod adapter to create your own model
<a name="cluster-specific-configurations-use-hyperpod-adapter-create-model"></a>

Within the recipe adapter, you can customize the following files in the following locations:

1. `collections/data`: Contains a module responsible for loading datasets. Currently, it only supports datasets from HuggingFace. If you have more advanced requirements, the code structure allows you to add custom data modules within the same folder.

1. `collections/model`: Includes the definitions of various language models. Currently, it supports common large language models like Llama, Mixtral, and Mistral. You have the flexibility to introduce your own model definitions within this folder.

1. `collections/parts`: This folder contains strategies for training models in a distributed manner. One example is the Fully Sharded Data Parallel (FSDP) strategy, which allows for sharding a large language model across multiple accelerators. Additionally, the strategies support various forms of model parallelism. You also have the option to introduce your own customized training strategies for model training.

1. `utils`: Contains various utilities aimed at facilitating the management of a training job. It serves as a repository where for your own tools. You can use your own tools for tasks such as troubleshooting or benchmarking. You can also add your own personalized PyTorch Lightning callbacks within this folder. You can use PyTorch Lightning callbacks to seamlessly integrate specific functionalities or operations into the training lifecycle.

1. `conf`: Contains the configuration schema definitions used for validating specific parameters in a training job. If you introduce new parameters or configurations, you can add your customized schema to this folder. You can use the customized schema to define the validation rules. You can validate data types, ranges, or any other parameter constraint. You can also define you own custom schema to validate the parameters.

# Appendix
<a name="appendix"></a>

Use the following information to get information about monitoring and analyzing training results.

## Monitor training results
<a name="monitor-training-results"></a>

Monitoring and analyzing training results is essential for developers to assess convergence and troubleshoot issues. SageMaker HyperPod recipes offer Tensorboard integration to analyze training behavior. To address the challenges of profiling large distributed training jobs, these recipes also incorporate VizTracer. VizTracer is a low-overhead tool for tracing and visualizing Python code execution. For more information about VizTracer, see [VizTracer](https://viztracer.readthedocs.io/en/latest/installation.html).

The following sections guide you through the process of implementing these features in your SageMaker HyperPod recipes.

### Tensorboard
<a name="tensorboard"></a>

Tensorboard is a powerful tool for visualizing and analyzing the training process. To enable Tensorboard, modify your recipe by setting the following parameter:

```
exp_manager:
  exp_dir: null
  name: experiment
  create_tensorboard_logger: True
```

After you enable the Tensorboard logger, the training logs are generated and stored within the experiment directory. The experiment directed is defined in exp\$1manager.exp\$1dir. To access and analyze these logs locally, use the following procedure:

**To access and analyze logs**

1. Download the Tensorboard experiment folder from your training environment to your local machine.

1. Open a terminal or command prompt on your local machine.

1. Navigate to the directory containing the downloaded experiment folder.

1. Launch Tensorboard with the following the command.

   ```
   tensorboard --port=<port> --bind_all --logdir experiment.
   ```

1. Open your web browser and visit http://localhost:8008.

You can now see the status and visualizations of your training jobs within the Tensorboard interface. Seeing the status and visualizations helps you monitor and analyze the training process. Monitoring and analyzing the training process helps you gain insights into the behavior and performance of your models. For more information about how you monitor and analyze the training with Tensorboard, see the [NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/index.html).

### VizTracer
<a name="viztracer"></a>

To enable VizTracer, you can modify your recipe by setting the model.viztracer.enabled parameter to true. For example, you can update your llama recipe to enable VizTracer by adding the following configuration:

```
model:
  viztracer:
    enabled: true
```

After the training has completed, your VizTracer profile is in the experiment folder exp\$1dir/result.json. To analyze your profile, you can download it and open it using the vizviewer tool:

```
vizviewer --port <port> result.json
```

This command launches the vizviewer on port 9001. You can view your VizTracer by specifying http://localhost:<port> in your browser. After you open VizTracer, you begin analyzing the training. For more information about using VizTracer, see VizTracer documentation.

## SageMaker JumpStart versus SageMaker HyperPod
<a name="sagemaker-jumpstart-vs-hyperpod"></a>

While SageMaker JumpStart provides fine-tuning capabilities, the SageMaker HyperPod recipes provide the following:
+ Additional fine-grained control over the training loop
+ Recipe customization for your own models and data
+ Support for model parallelism

Use the SageMaker HyperPod recipes when you need access to the model's hyperparameters, multi-node training, and customization options for the training loop.

For more information about fine-tuning your models in SageMaker JumpStart, see [Fine-tune publicly available foundation models with the `JumpStartEstimator` class](jumpstart-foundation-models-use-python-sdk-estimator-class.md)