Mapping AWS Batch service job status to SageMaker AI status - AWS Batch

Mapping AWS Batch service job status to SageMaker AI status

When you submit jobs to a SageMaker job queue using SubmitServiceJob, AWS Batch manages the job lifecycle and maps AWS Batch job states to equivalent SageMaker Training job states. Service jobs, such as SageMaker Training jobs, follow a different state lifecycle than traditional container jobs. While service jobs share most states with container jobs, they introduce the SCHEDULED state and exhibit different retry behaviors, particularly for handling insufficient capacity errors from the target service.

The following table shows the AWS Batch job state and the corresponding SageMaker Status/SecondaryStatus:

Batch Status SageMaker AI Primary Status SageMaker AI Secondary Status Description
SUBMITTED N/A N/A Job submitted to queue, waiting for scheduler evaluation.
RUNNABLE N/A N/A Job is queued and ready for scheduling. Jobs in this state are started as soon as sufficient resources are available in the service environment. Jobs can remain in this state indefinitely when sufficient resources are unavailable.
SCHEDULED InProgress Pending Service job successfully submitted to SageMaker AI
STARTING InProgress Downloading SageMaker Training job downloading data and images. Training job capacity has been acquired and job initialization begins.
RUNNING InProgress Training SageMaker Training job executing algorithm
RUNNING InProgress Uploading SageMaker Training job uploading output artifacts after training completion
SUCCEEDED Completed Completed SageMaker Training job completed successfully. Output artifacts finished uploading.
FAILED Failed Failed SageMaker Training job encountered an unrecoverable error.
FAILED Stopped Stopped SageMaker Training job was manually stopped using StopTrainingJob.