AWS Batch support for SageMaker AI training jobs - Amazon SageMaker AI

AWS Batch support for SageMaker AI training jobs

An AWS Batch job queue stores and prioritizes submitted jobs before they run on compute resources. You can submit SageMaker AI training jobs to a job queue in order to take advantage of the serverless job scheduling and prioritization tools provided by AWS Batch.

How it works

The following steps describe the workflow of how to use an AWS Batch job queue with SageMaker AI training jobs. For more detailed tutorials and example notebooks, see the Get started section.

  • Set up AWS Batch and any necessary permissions. For more information, see Setting up AWS Batch in the AWS Batch User Guide.

  • Create the following AWS Batch resources in the console or using the AWS CLI:

  • Configure your details and request for a SageMaker AI training job, such as your training container image. To submit a training job to an AWS Batch queue, you can use the AWS CLI, the AWS SDK for Python (Boto3), or the SageMaker AI Python SDK.

  • Submit your training jobs to the job queue. You can use the following options to submit jobs:

    • Use the AWS Batch SubmitServiceJob API.

    • Use the aws_batch module from the SageMaker AI Python SDK. After creating a TrainingQueue object and a model training object (such as an Estimator or ModelTrainer), you can submit training jobs to the TrainingQueue using the queue.submit() method.

  • After submitting jobs, view your job queue and job status with the AWS Batch console, the AWS Batch DescribeServiceJob API, or the SageMaker AI DescribeTrainingJob API.

Cost and availability

For detailed pricing information about training jobs, see Amazon SageMaker AI pricing. With AWS Batch, you only pay for any AWS resources used, such as Amazon EC2 instances. For more information, see AWS Batch pricing.

You can use AWS Batch for SageMaker AI training jobs in any AWS Region where training jobs are available. For more information, see Amazon SageMaker AI endpoints and quotas.

To ensure you have the required capacity when you need it, you can use SageMaker AI Flexible Training Plans (FTP). These plans allow you to reserve capacity for your training jobs. When combined with AWS Batch's queuing capabilities, you can maximize utilization during your plan's duration. For more information, see Reserve training plans for you training jobs or HyperPod clusters.

Get started

For a tutorial on how to set up an AWS Batch job queue and submit SageMaker AI training jobs, see Getting started with AWS Batch on SageMaker AI in the AWS Batch User Guide.

For Jupyter notebooks that show how to use the aws_batch module in the SageMaker AI Python SDK, see the AWS Batch for SageMaker AI Training jobs notebook examples in the amazon-sagemaker-examples GitHub repository.