# Amazon SageMaker HyperPod checkpointless training tutorials [ HyperPod checkpointless training recipes](https://github.com/aws/sagemaker-hyperpod-checkpointless-training) are predefined job configurations with checkpointless training features enabled. Using these recipes, makes it easier to get started with checkpointless training on HyperPod. **Topics** + [Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b](sagemaker-eks-checkpointless-recipes-finetune.md) + [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b](sagemaker-eks-checkpointless-recipes-peft.md) + [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b](sagemaker-eks-checkpointless-recipes-pretraining-llama3.md) + [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b](sagemaker-eks-checkpointless-recipes-peft-llama.md) + [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models](sagemaker-eks-checkpointless-recipes-custom.md) # Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b The following sequence of steps is required to run checkpointless training recipes on HyperPod. ## Prerequisites Before you start setting up your environment, make sure you have: + [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html) + [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html) + A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes. + Data in one of the following formats: + JSON + JSONGZ (Compressed JSON) + ARROW + Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection). + [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face). + Setup your environment ## Kubernetes environment setup To set up your Kubernetes environment, do the following: 1. Set up the virtual environment. Make sure your version of Python is greater than or equal to 3.10 and lower than 3.14. ``` python3 -m venv ${PWD}/venv source venv/bin/activate ``` 1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) 1. [ Install Helm](https://helm.sh/docs/intro/install/) 1. Connect to your Kubernetes cluster ``` aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}" ``` 1. Install dependencies using one of the following methods: 1. Method 1: SageMaker HyperPod recipes method: ``` # install SageMaker HyperPod Recipes. git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git cd sagemaker-hyperpod-recipes pip3 install -r requirements.txt ``` 1. Method 2: kubectl with pre-defined job yaml method ``` # install SageMaker HyperPod checkpointless training. git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git cd sagemaker-hyperpod-checkpointless-training ``` You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl. ## Launch training jobs with the recipes launcher You can use the Amazon SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script. 1. Update `launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh` your\$1container: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html). ``` #!/bin/bash SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"} TRAIN_DIR="${TRAIN_DIR}" VAL_DIR="${VAL_DIR}" EXP_DIR="${EXP_DIR}" LOG_DIR="${LOG_DIR}" CONTAINER_MOUNT="/data" CONTAINER="${CONTAINER}" MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}" HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \ recipes=fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_full_fine_tuning \ recipes.dataset.dataset_path="${TRAIN_DIR}" \ recipes.exp_manager.exp_dir="${EXP_DIR}" \ recipes.log_dir="${LOG_DIR}" \ recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \ base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \ git.use_default=false \ cluster=k8s \ cluster_type=k8s \ container="${CONTAINER}" \ +cluster.hostNetwork=true \ +cluster.persistent_volume_claims.0.claimName=fsx-claim \ +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \ +recipes.dataset.val_dataset_path="${VAL_DIR}" \ ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \ ``` 1. Launch the training job ``` bash launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh ``` After you've submitted the training job, you can use the following command to verify if you submitted it successfully. ``` kubectl get pods NAME READY STATUS RESTARTS AGE gpt-oss-120b-worker-0 0/1 running 0 36s ``` If the STATUS is at PENDING or ContainerCreating, run the following command to get more details ``` kubectl describe pod ``` After the job STATUS changes to Running, you can examine the log by using the following command. ``` kubectl logs ``` The `STATUS` will turn to `COMPLETED` when you run `kubectl get pods`. ## Launch the training job with kubectl with pre-defined yaml Another option is to launch the training through kubectl with a pre-defined job yaml. 1. update the examples/gpt\$1oss/launch/full\$1finetune\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml + image: A Deep Learning container. To find the most recent release of the checkpointless training container, see [checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html). + resume.restore\$1config.path=: The path to downloaded pretrained model weigths in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step. + dataset.dataset\$1path=: The path to the dataset that stored in the shared storage 1. Submit the job using kubectl with full\$1finetune\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml ``` kubectl apply -f examples/gpt_oss/launch/full_finetune_gpt_oss_120b_checkpointless_p5.yaml ``` After you've submitted the training job, you can use the following command to verify if you submitted it successfully. ``` kubectl get pods NAME READY STATUS RESTARTS AGE gpt-oss-120b-worker-0 0/1 running 0 36s ``` If the STATUS is at PENDING or ContainerCreating, run the following command to get more details ``` kubectl describe pod ``` After the job STATUS changes to Running, you can examine the log by using the following command. ``` kubectl logs ``` The STATUS will turn to Completed when you run kubectl get pods # Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b The following sequence of steps is required to run checkpointless training recipes on HyperPod. ## Prerequisites Before you start setting up your environment, make sure you have: + [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html) + [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html) + A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes. + Data in one of the following formats: + JSON + JSONGZ (Compressed JSON) + ARROW + Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection). + [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face). + Setup your environment ## Kubernetes environment setup To set up your Kubernetes environment, do the following: 1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and < 3.14. ``` python3 -m venv ${PWD}/venv source venv/bin/activate ``` 1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) 1. [ Install Helm](https://helm.sh/docs/intro/install/) 1. Connect to your Kubernetes cluster ``` aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}" ``` 1. Install dependencies using one of the following methods: + SageMaker HyperPod recipes method: ``` # install SageMaker HyperPod Recipes. git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git cd sagemaker-hyperpod-recipes pip3 install -r requirements.txt ``` + kubectl with pre-defined job yaml method ``` # install SageMaker HyperPod checkpointless training. git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git cd sagemaker-hyperpod-checkpointless-training ``` You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl. ## Launch the training job with the recipes launcher Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script. 1. Update `launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh` your\$1contrainer: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html). ``` #!/bin/bash SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"} TRAIN_DIR="${TRAIN_DIR}" VAL_DIR="${VAL_DIR}" EXP_DIR="${EXP_DIR}" LOG_DIR="${LOG_DIR}" CONTAINER_MOUNT="/data" CONTAINER="${CONTAINER}" MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}" HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \ recipes=fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_lora \ recipes.dataset.dataset_path="${TRAIN_DIR}" \ recipes.exp_manager.exp_dir="${EXP_DIR}" \ recipes.log_dir="${LOG_DIR}" \ recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \ base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \ git.use_default=false \ cluster=k8s \ cluster_type=k8s \ container="${CONTAINER}" \ +cluster.hostNetwork=true \ +cluster.persistent_volume_claims.0.claimName=fsx-claim \ +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \ +recipes.dataset.val_dataset_path="${VAL_DIR}" \ ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \ ``` 1. Launch the training job ``` bash launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh ``` After you've submitted the training job, you can use the following command to verify if you submitted it successfully. ``` kubectl get pods NAME READY STATUS RESTARTS AGE gpt-oss-120b-worker-0 0/1 running 0 36s ``` If the STATUS is at PENDING or ContainerCreating, run the following command to get more details ``` kubectl describe pod ``` After the job STATUS changes to Running, you can examine the log by using the following command. ``` kubectl logs ``` The STATUS will turn to Completed when you run kubectl get pods ## Launch the training job with kubectl with pre-defined yaml Another option is to launch the training through kubectl with a pre-defined job yaml. 1. update the examples/gpt\$1oss/launch/peft\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml + image: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html). + resume.restore\$1config.path=: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft.html#sagemaker-eks-checkpointless-recipes-peft-prereqs) step. + dataset.dataset\$1path=: The path to the dataset that stored in the shared storage 1. Submit the job using kubectl with peft\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml ``` kubectl apply -f examples/gpt_oss/launch/peft_gpt_oss_120b_checkpointless_p5.yaml ``` After you've submitted the training job, you can use the following command to verify if you submitted it successfully. ``` kubectl get pods NAME READY STATUS RESTARTS AGE gpt-120b-lora-checkpointless-worker-0 0/1 running 0 36s ``` If the STATUS is at PENDING or ContainerCreating, run the following command to get more details ``` kubectl describe pod ``` After the job STATUS changes to Running, you can examine the log by using the following command. ``` kubectl logs ``` The STATUS will turn to Completed when you run kubectl get pods # Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b The following sequence of steps is required to run checkpointless training recipes on HyperPod. ## Prerequisites Before you start setting up your environment, make sure you have: + [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html) + [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html) + A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes. + Data in one of the following formats: + JSON + JSONGZ (Compressed JSON) + ARROW + Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [ source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection). + [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face). + Setup your environment ## Kubernetes environment setup To set up your Kubernetes environment, do the following: 1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14. ``` python3 -m venv ${PWD}/venv source venv/bin/activate ``` 1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) 1. [ Install Helm](https://helm.sh/docs/intro/install/) 1. Connect to your Kubernetes cluster ``` aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}" ``` 1. Install dependencies using one of the following methods: 1. Method 1: SageMaker HyperPod recipes method: ``` # install SageMaker HyperPod Recipes. git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git cd sagemaker-hyperpod-recipes pip3 install -r requirements.txt ``` 1. Method 2: kubectl with pre-defined job yaml method ``` # install SageMaker HyperPod checkpointless training. git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git cd sagemaker-hyperpod-checkpointless-training ``` You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl. ## Method 1: Launch the training job with the recipes launcher Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script. 1. Update `launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh` A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html). ``` #!/bin/bash SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"} TRAIN_DIR="${TRAIN_DIR}" VAL_DIR="${VAL_DIR}" EXP_DIR="${EXP_DIR}" LOG_DIR="${LOG_DIR}" CONTAINER_MOUNT="/data" CONTAINER="${CONTAINER}" HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \ recipes=training/llama/checkpointless_llama3_70b_pretrain \ recipes.dataset.dataset_path="${TRAIN_DIR}" \ recipes.exp_manager.exp_dir="${EXP_DIR}" \ recipes.log_dir="${LOG_DIR}" \ recipes.data.global_batch_size=16 \ recipes.data.micro_batch_size=4 \ base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \ git.use_default=false \ cluster=k8s \ cluster_type=k8s \ container="${CONTAINER}" \ +cluster.hostNetwork=true \ +cluster.persistent_volume_claims.0.claimName=fsx-claim \ +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \ +recipes.dataset.val_dataset_path="${VAL_DIR}" \ ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \ ``` 1. Launch the training job ``` bash launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh ``` 1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully. ``` kubectl get pods NAME READY STATUS RESTARTS AGE llama-3-70b-worker-0 0/1 running 0 36s ``` 1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details ``` kubectl describe pod ``` 1. After the job STATUS changes to Running, you can examine the log by using the following command. ``` kubectl logs ``` The STATUS will turn to Completed when you run kubectl get pods ## Method 2: Launch the training job with kubectl with pre-defined yaml Another option is to launch the training through kubectl with a pre-defined job yaml. 1. Update the `examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml` + `image`: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html). + `resume.restore_config.path=`: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step. + `dataset.dataset_path=`: The path to the dataset that stored in the shared storage 1. Submit the job using kubectl with `pretrain_llama3_70b_checkpointless_p5.yaml` ``` kubectl apply -f examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml ``` 1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully. ``` kubectl get pods NAME READY STATUS RESTARTS AGE llama3-pretrain-checkpointless-worker-0 0/1 running 0 36s ``` 1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details ``` kubectl describe pod ``` 1. After the job STATUS changes to Running, you can examine the log by using the following command. ``` kubectl logs ``` The STATUS will turn to Completed when you run kubectl get pods # Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b The following sequence of steps is required to run checkpointless training recipes on HyperPod. ## Prerequisites Before you start setting up your environment, make sure you have: + [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html) + [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html) + A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes. + Data in one of the following formats: + JSON + JSONGZ (Compressed JSON) + ARROW + Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [ source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection). + [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face). + Setup your environment ## Kubernetes environment setup To set up your Kubernetes environment, do the following: 1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14. ``` python3 -m venv ${PWD}/venv source venv/bin/activate ``` 1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) 1. [ Install Helm](https://helm.sh/docs/intro/install/) 1. Connect to your Kubernetes cluster ``` aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}" ``` 1. Install dependencies using one of the following methods: 1. Method 1: SageMaker HyperPod recipes method: ``` # install SageMaker HyperPod Recipes. git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git cd sagemaker-hyperpod-recipes pip3 install -r requirements.txt ``` 1. Method 2: kubectl with pre-defined job yaml method ``` # install SageMaker HyperPod checkpointless training. git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git cd sagemaker-hyperpod-checkpointless-training ``` You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl. ## Method 1: Launch the training job with the recipes launcher Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script. 1. Update `launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh` A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html). ``` #!/bin/bash SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"} TRAIN_DIR="${TRAIN_DIR}" VAL_DIR="${VAL_DIR}" EXP_DIR="${EXP_DIR}" LOG_DIR="${LOG_DIR}" CONTAINER_MOUNT="/data" CONTAINER="${CONTAINER}" MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}" HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \ recipes=fine-tuning/llama/checkpointless_llama3_70b_lora \ recipes.dataset.dataset_path="${TRAIN_DIR}" \ recipes.exp_manager.exp_dir="${EXP_DIR}" \ recipes.log_dir="${LOG_DIR}" \ recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \ base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \ git.use_default=false \ cluster=k8s \ cluster_type=k8s \ container="${CONTAINER}" \ +cluster.hostNetwork=true \ +cluster.persistent_volume_claims.0.claimName=fsx-claim \ +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \ +recipes.dataset.val_dataset_path="${VAL_DIR}" \ ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \ ``` 1. Launch the training job ``` bash launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh ``` 1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully. ``` kubectl get pods NAME READY STATUS RESTARTS AGE llama-3-70b-worker-0 0/1 running 0 36s ``` 1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details ``` kubectl describe pod ``` 1. After the job STATUS changes to Running, you can examine the log by using the following command. ``` kubectl logs ``` The STATUS will turn to Completed when you run kubectl get pods ## Method 2: Launch the training job with kubectl with pre-defined yaml Another option is to launch the training through kubectl with a pre-defined job yaml. 1. Update the `examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml` + `image`: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html). + `resume.restore_config.path=`: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step. + `dataset.dataset_path=`: The path to the dataset that stored in the shared storage 1. Submit the job using kubectl with `peft_llama3_70b_checkpointless_p5.yaml` ``` kubectl apply -f examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml ``` 1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully. ``` kubectl get pods NAME READY STATUS RESTARTS AGE llama3-70b-lora-checkpointless-worker-0 0/1 running 0 36s ``` 1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details ``` kubectl describe pod ``` 1. After the job STATUS changes to Running, you can examine the log by using the following command. ``` kubectl logs ``` The STATUS will turn to Completed when you run kubectl get pods # Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models The following sequence of steps is required to run checkpointless training with your custom model on HyperPod. ## Prerequisites Before you start setting up your environment, make sure you have: + [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html) + [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html) + A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes. + Data in one of the following formats: + JSON + JSONGZ (Compressed JSON) + ARROW + [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face). + Setup your environment ## Kubernetes environment setup To set up your Kubernetes environment, do the following: 1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14. ``` python3 -m venv ${PWD}/venv source venv/bin/activate ``` 1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) 1. Connect to your Kubernetes cluster ``` aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}" ``` 1. Install dependencies ``` # install SageMaker HyperPod checkpointless training. git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git cd sagemaker-hyperpod-checkpointless-training ``` ## Checkpointless training modification instructions To incrementally adopt checkpointless training for custom models, follow the integration guide (here we use Llama 3 70b pretraining as an example), which involves: + Fast communicator creation + Memory-mapped dataloader (MMAP) + In-process & Checkpointless recovery ### Component 1: Fast communicator creation This is to optimize time to establish connections between the workers. There is no code changes needed and only requires setting env variables ``` # Enable Rootless features export HPCT_USE_ROOTLESS=1 && \ sysctl -w net.ipv4.ip_local_port_range="20000 65535" && \ hyperpodrun --nproc_per_node=8 \ ... --inprocess-restart \ ... ``` The full change can be found in the [ llama3 70 pretrain launch job config](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml). ### Component 2: Memory-mapped dataloader (MMAP) MMAP caches to store pre-fetched data samples & enable immediate training start without needing to wait for data preprocessing. It requires minimal code changes to adopt by wrapping existing dataloader. ``` data_module = MMAPDataModule( data_module=base_data_module, mmap_config=CacheResumeMMAPConfig(cache_dir=…) ) ``` ### Components 3 and 4: In-process and checkpointless recovery This enables failure recovery without restart training processes or loading from checkpoints. Additional code changes needed (strategy & training config update, wrap existing main) ``` @HPWrapper( health_check=CudaHealthCheck(), hp_api_factory=HPAgentK8sAPIFactory(), abort_timeout=60.0, ...) def run_main( cfg, caller: Optional[HPCallWrapper] = None): ... CheckpointlessMegatronStrategy( **self.cfg.strategy, ddp=self.ddp, ) ``` The full change can be found in the [llama3 70 pretrain entry script](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/llama3_70b_pretrain_checkpointless.py) and the corresponding training config change can be found in the [ llama3 70b training config](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/config/llama3_70b_peft_checkpointless.yaml). ### Launch training You can now launch the checkpointless training using kubectl. ``` kubectl apply -f your_job_config.yaml ```