AWS Batch support for SageMaker AI training jobs
An AWS Batch job queue stores and prioritizes submitted jobs before they run on compute resources. You can submit SageMaker AI training jobs to a job queue in order to take advantage of the serverless job scheduling and prioritization tools provided by AWS Batch.
How it works
The following steps describe the workflow of how to use an AWS Batch job queue with SageMaker AI training jobs. For more detailed tutorials and example notebooks, see the Get started section.
-
Set up AWS Batch and any necessary permissions. For more information, see Setting up AWS Batch in the AWS Batch User Guide.
-
Create the following AWS Batch resources in the console or using the AWS CLI:
-
Service environment – Contains configuration parameters for integrating with SageMaker AI.
-
SageMaker AI training job queue – Integrates with SageMaker AI to submit training jobs.
-
-
Configure your details and request for a SageMaker AI training job, such as your training container image. To submit a training job to an AWS Batch queue, you can use the AWS CLI, the AWS SDK for Python (Boto3), or the SageMaker AI Python SDK.
-
Submit your training jobs to the job queue. You can use the following options to submit jobs:
-
Use the AWS Batch SubmitServiceJob API.
-
Use the
aws_batch
modulefrom the SageMaker AI Python SDK. After creating a TrainingQueue object and a model training object (such as an Estimator or ModelTrainer), you can submit training jobs to the TrainingQueue using the queue.submit()
method.
-
-
After submitting jobs, view your job queue and job status with the AWS Batch console, the AWS Batch DescribeServiceJob API, or the SageMaker AI DescribeTrainingJob API.
Cost and availability
For detailed pricing information about training jobs, see Amazon SageMaker AI pricing
You can use AWS Batch for SageMaker AI training jobs in any AWS Region where training jobs are available. For more information, see Amazon SageMaker AI endpoints and quotas.
To ensure you have the required capacity when you need it, you can use SageMaker AI Flexible Training Plans (FTP). These plans allow you to reserve capacity for your training jobs. When combined with AWS Batch's queuing capabilities, you can maximize utilization during your plan's duration. For more information, see Reserve training plans for you training jobs or HyperPod clusters.
Get started
For a tutorial on how to set up an AWS Batch job queue and submit SageMaker AI training jobs, see Getting started with AWS Batch on SageMaker AI in the AWS Batch User Guide.
For Jupyter notebooks that show how to use the aws_batch
module in the
SageMaker AI Python SDK, see the AWS Batch for SageMaker AI Training jobs notebook examples in the
amazon-sagemaker-examples GitHub repository