

# Service environments for AWS Batch
<a name="service-environments"></a>

Service environments enable AWS Batch to integrate with SageMaker AI. A service environment contains the SageMaker AI specific configuration parameters required for AWS Batch to submit and manage SageMaker Training jobs while providing AWS Batch's queuing, scheduling, and priority management capabilities.

With service environments, data scientists and ML engineers can submit SageMaker Training jobs with priorities to service job queues. This integration eliminates the need for manual coordination of ML workloads, prevents accidental overspending, and improves resource utilization across your organization's machine learning workflows.

**Topics**
+ [What are service environments in AWS Batch](what-are-service-environments.md)
+ [Service environment states and lifecycle in AWS Batch](service-environment-states.md)
+ [Create a service environment in AWS Batch](create-service-environments.md)
+ [Update a service environment in AWS Batch](updating-service-environments.md)
+ [Delete a service environment in AWS Batch](deleting-service-environments.md)

# What are service environments in AWS Batch
<a name="what-are-service-environments"></a>

A service environment is a AWS Batch resource that contains the configuration parameters required to integrate AWS Batch with SageMaker AI. Service environments enable AWS Batch to submit and manage SageMaker Training jobs while providing AWS Batch's queuing, scheduling, and priority management capabilities.

Service environments address common challenges that data science teams face when managing machine learning workloads. Organizations often limit the number of instances available for training models to prevent accidental overspending, meet budget constraints, save costs with reserved instances, or use specific instance types for workloads. However, data scientists may want to run more workloads concurrently than is possible with their allocated instances, requiring manual coordination to decide which workloads run when.

This coordination challenge impacts organizations of all sizes, from teams with just a few data scientists to large-scale operations. As organizations grow, the complexity increases, requiring more time to manage workload coordination and often necessitating infrastructure administrator involvement. These manual efforts waste time and reduce instance efficiency, resulting in real costs for customers.

With service environments, data scientists and ML engineers can submit SageMaker Training jobs with priorities to configurable queues, ensuring workloads run automatically without intervention as soon as resources are available. This integration leverages AWS Batch's extensive queuing and scheduling capabilities, enabling customers to customize their queuing and scheduling policies to match their organization's goals.

## How service environments work with other AWS Batch components
<a name="service-environment-integration"></a>

Service environments integrate with other AWS Batch components to enable SageMaker Training job queuing:
+ **Job queues** - Service environments are associated with job queues to enable the queue to process service jobs for SageMaker Training job
+ **Service jobs** - When you submit a service job to a queue associated with a service environment, AWS Batch uses the environment's configuration to submit the corresponding SageMaker Training job
+ **Scheduling policies** - Service environments work with AWS Batch scheduling policies to prioritize and manage the execution order of SageMaker Training jobs

This integration allows you to leverage AWS Batch's mature queuing and scheduling capabilities while maintaining the full functionality and flexibility of SageMaker Training jobs.

## Best practices for service environments
<a name="service-environment-best-practices"></a>

Service environments provide capabilities for managing SageMaker Training jobs at scale. Following these best practices helps you optimize cost, performance, and operational efficiency while avoiding common configuration issues that can impact your machine learning workflows.

When planning service environment capacity, consider the specific quotas and limits that apply to SageMaker Training job queuing. Each service environment has a maximum capacity limit expressed in number of instances, which directly controls how many SageMaker Training jobs can run concurrently. Understanding these limits helps prevent resource contention and ensures predictable job execution times. 

Optimal service environment performance depends on understanding the unique characteristics of SageMaker Training job scheduling. Unlike traditional containerized jobs, service jobs transition through a `SCHEDULED` state while SageMaker AI acquires and provisions the necessary training instances. This means job start times can vary significantly based on instance availability and regional capacity. 

**Important**  
Service environments have specific quotas that can impact your ability to scale SageMaker Training workloads. You can create up to 50 service environments per account, with each job queue supporting only one associated service environment. Additionally, the Service Request Payload for individual jobs is limited to 10 KiB, and the `SubmitServiceJob` API is limited to 5 transactions per second per account. Understanding these limits during capacity planning prevents unexpected scaling constraints.

Effective monitoring of service environments requires attention to both AWS Batch and SageMaker AI service metrics. [Job state transitions](service-job-status.md) provide valuable insights into system performance, particularly the time spent in `SCHEDULED` state, which indicates capacity availability patterns. Service environments maintain their own lifecycle states similar to compute environments, transitioning through `CREATING`, `VALID`, `INVALID`, and `DELETING` states that should be monitored for operational health. Organizations with mature monitoring practices typically track queue depth, job completion rates, and instance utilization patterns to optimize their service environment configurations over time.

# Service environment states and lifecycle in AWS Batch
<a name="service-environment-states"></a>

Service environments maintain lifecycle states that indicate their current operational status and readiness to process SageMaker Training jobs. Understanding these states helps you monitor service environment health, troubleshoot configuration issues, and ensure reliable job processing. The state management system follows established patterns from compute environments while accommodating the unique requirements of SageMaker Training job integration.

Service environment states are managed automatically by AWS Batch based on configuration validation, resource availability, and operational health checks. Unlike compute environments that manage physical infrastructure, service environments focus on configuration validation and integration readiness with SageMaker AI services. The state transitions provide visibility into whether your service environment can successfully submit and manage SageMaker Training jobs.

# Service environment state definitions
<a name="service-environment-state-definitions"></a>

Service environments can be in one of four possible states that indicate their current operational status and readiness to process SageMaker Training jobs. Each state represents a specific phase in the service environment lifecycle, from initial creation through operational readiness to eventual deletion. The following table describes each state and its meaning:


| State | Description | 
| --- | --- | 
| CREATING |  The initial state when you create a service environment. During this state, AWS Batch validates the configuration parameters and establishes integration with SageMaker AI services. The service environment cannot process jobs, and any job queues associated with it will not accept service job submissions. The creation process typically completes within a few seconds for properly configured service environments.  | 
| VALID |  The operational state indicating that the service environment has passed all configuration validation checks and is ready to process SageMaker Training jobs. This state indicates that the service environment configuration is correct, all required permissions are in place, and AWS Batch can successfully submit jobs to SageMaker AI on your behalf. Service environments spend most of their operational lifecycle in this state.  | 
| INVALID |  A state indicating that the service environment has encountered a configuration or permissions issue that prevents it from processing SageMaker Training jobs. Job queues associated with invalid service environments cannot process new service job submissions until the underlying issues are resolved.  | 
| DELETING |  The state that occurs when you request deletion of a service environment. During this state, AWS Batch ensures that no active SageMaker Training jobs are associated with the environment and performs necessary cleanup operations. Service environments in this state cannot process new job submissions, and the deletion process completes once all associated resources are properly cleaned up.  | 

## Service environment state transitions
<a name="service-environment-state-transitions"></a>

Service environment state transitions occur automatically based on configuration changes, validation results, and operational health monitoring. The AWS Batch service continuously monitors service environment health and updates states accordingly. Understanding these transitions helps you anticipate when configuration changes will take effect and how to resolve issues that cause invalid states.

After successful creation and validation, service environments transition from `CREATING` to `VALID`. This transition confirms that all configuration parameters are correct, required IAM permissions are properly configured, and the service environment can successfully integrate with SageMaker AI services. Once in the `VALID` state, associated job queues can begin processing service job submissions.

Service environments transition from `VALID` to `INVALID` when configuration validation fails or when dependencies become unavailable. This can occur due to IAM role modifications, capacity limit changes that violate quotas, or external resource modifications that affect the service environment's ability to function. The status reason field provides specific details about what caused the invalid state.

Service environments can transition back to `VALID` from `INVALID` once the underlying issues are resolved. This might involve updating IAM permissions, correcting capacity configurations, or restoring access to required AWS resources. The transition typically occurs automatically once AWS Batch detects that the configuration issues have been addressed.

# Create a service environment in AWS Batch
<a name="create-service-environments"></a>

Before you can run SageMaker Training jobs in AWS Batch, you need to create a service environment. You can create a service environment that contains the configuration parameters required for AWS Batch to integrate with SageMaker AI services and submit SageMaker Training jobs on your behalf.

## Prerequisites
<a name="create-service-environments-prerequisites"></a>

Before creating a service environment, ensure you have:
+ **IAM permissions** – Permissions to create and manage service environments. For more information, see [AWS Batch IAM policies, roles, and permissions](IAM_policies.md).

------
#### [ Create a service environment (AWS Console) ]

Use the AWS Batch console to create a service environment through the web interface.

**To create a service environment**

1. Open the AWS Batch console at [https://console.aws.amazon.com/batch/](https://console.aws.amazon.com/batch/).

1. In the navigation pane, choose **Environments**.

1. Choose **Create environment**, select **Service environment**.

1. For **Service environment configuration** choose SageMaker AI.

1. For **Name**, enter a unique name for your service environment. Valid characters are a-z, A-Z, 0-9, hyphens (-), and underscores (\$1).

1. For **Max number of instances** enter the maximum number of concurrent training instances

1. (Optional) Add tags by choosing **Add tag** and entering key-value pairs.

1. Choose **Next**.

1. Review the details of the new service environment and choose **Create service environment**.

------
#### [ Create a service environment (AWS CLI) ]

Use the `create-service-environment` command to create a service environment with the AWS CLI.

**To create a service environment**

1. Create a service environment with the basic required parameters:

   ```
   aws batch create-service-environment \
       --service-environment-name my-sagemaker-service-env \
       --service-environment-type SAGEMAKER_TRAINING \
       --capacity-limits capacityUnit=NUM_INSTANCES,maxCapacity=10
   ```

1. (Optional) Create a service environment with tags:

   ```
   aws batch create-service-environment \
       --service-environment-name my-sagemaker-service-env \
       --service-environment-type SAGEMAKER_TRAINING \
       --capacity-limits capacityUnit=NUM_INSTANCES,maxCapacity=10 \
       --tags team=data-science,project=ml-training
   ```

1. Verify the service environment was created successfully:

   ```
   aws batch describe-service-environments \
       --service-environment my-sagemaker-service-env
   ```

The service environment appears in the Environments list with a `CREATING` state. When creation completes successfully, the state changes to `VALID` and the service environment is ready to have a service job queue added to it so the service environment can start processing jobs.

------

# Update a service environment in AWS Batch
<a name="updating-service-environments"></a>

You can update a service environment to modify its capacity limits, change its operational state, or update resource tags. Service environment updates allow you to adjust capacity as your SageMaker Training workload requirements change or modify operational settings without recreating the environment. Before updating a service environment, understand which parameters can be modified and the impact of changes on running jobs.

You can change the Capacity limits, State, or Tags of a service environment.

------
#### [ Update a service environment (AWS Console) ]

Use the AWS Batch console to update a service environment through the web interface.

**To update a service environment**

1. Open the AWS Batch console at [https://console.aws.amazon.com/batch/](https://console.aws.amazon.com/batch/).

1. In the navigation pane, choose **Environments**.

1. Choose the **Service environment** tab.

1. Choose the service environment to update.

1. Choose **Actions**, then choose either:
   + **State** - Choose **Enable** or **Disable** to change the state.
   + **Capacity limit** - Modify the **Max number of instances**

1. Choose **Save changes** to apply the changes.

The service environment updates immediately. Check the environment details to confirm the changes were applied successfully. If you disabled the service environment, associated job queues will stop processing new service job submissions until you re-enable it.

------
#### [ Update a service environment (AWS CLI) ]

Use the `update-service-environment` command to modify a service environment with the AWS CLI.

**To update service environment capacity limits**

1. Update the capacity limit for a service environment:

   ```
   aws batch update-service-environment \
       --service-environment my-sagemaker-service-env \
       --capacity-limits capacityUnit=NUM_INSTANCES,maxCapacity=20
   ```

1. Verify the update was applied successfully:

   ```
   aws batch describe-service-environments \
       --service-environments my-sagemaker-service-env
   ```

**To update service environment state**

1. Disable a service environment to stop processing new jobs:

   ```
   aws batch update-service-environment \
       --service-environment my-sagemaker-service-env \
       --state DISABLED
   ```

1. Re-enable a service environment to resume processing:

   ```
   aws batch update-service-environment \
       --service-environment my-sagemaker-service-env \
       --state ENABLED
   ```

Service environment updates take effect immediately. Monitor the service environment state to ensure updates complete successfully before submitting new jobs.

------

# Delete a service environment in AWS Batch
<a name="deleting-service-environments"></a>

You can delete a service environment when it's no longer needed for your SageMaker Training jobs. Deleting a service environment removes the configuration and prevents further job submissions. Before deleting a service environment, ensure that no active SageMaker Training jobs depend on it and that no job queues are associated with the service environment.

**Important**  
Service environment deletion is irreversible. Once deleted, you cannot recover the service environment or its configuration. If you need similar functionality in the future, you must create a new service environment with the required settings. Consider disabling the service environment instead of deletion if you may need to reactivate it later.

**Note**  
Deleting all service environments in your account does not automatically remove the service-linked role created for AWS Batch and SageMaker AI integration. The service-linked role remains available for future service environment creation. If you want to remove the service-linked role, you must delete it separately using IAM after ensuring no service environments exist in your account.

## Deletion prerequisites
<a name="service-environment-deletion-prerequisites"></a>

Before you can delete a service environment you must disassociate any service job queue and then disable the service environment.

**Before deleting a service environment:**
+ **Check active jobs** - Ensure no SageMaker Training jobs are currently running through the service environment.
+ **Review job queues** - Identify job queues associated with the service environment and either associate the job queue with a different service environment or disable and delete the job queue.

**Job queue management:** Job queues that were associated with a deleted service environment can still exist but cannot process service jobs. You should either delete unused job queues or associate them with a different service environment before deleting the original service environment.

------
#### [ Delete a service environment (AWS Console) ]

Use the AWS Batch console to delete a service environment through the web interface.

**To delete a service environment**

1. Open the AWS Batch console at [https://console.aws.amazon.com/batch/](https://console.aws.amazon.com/batch/).

1. In the navigation pane, choose **Environments**.

1. Choose the **Service environment** tab and then choose a service environment.

1. If the service environment is enabled, choose **Actions** and then **Disable**.

1. Once the service environment is disabled, choose **Actions** and then **Delete**.

1. In the confirmation dialog, choose **Confirm**.

The service environment shows a `DELETING` state while deletion occurs. Once deletion completes, the service environment disappears from the Environments list.

------
#### [ Delete a service environment (AWS CLI) ]

Use the `delete-service-environment` command to remove a service environment with the AWS CLI.

**To delete a service environment**

1. Check for associated job queues with the service environment:

   ```
   aws batch describe-job-queues
   ```

   If there are any job queues associated with the service environment you can either [disassociate the job queue](https://docs.aws.amazon.com/batch/latest/APIReference/API_UpdateJobQueue.html) from the service environment and associate it with a different service environment, or delete the job queue.

1. Disable the service environment:

   ```
   aws batch update-service-environment \
       --service-environment my-sagemaker-service-env \
       --state DISABLED
   ```

1. Delete the service environment:

   ```
   aws batch delete-service-environment \
       --service-environment my-sagemaker-service-env
   ```

1. Monitor the deletion process:

   ```
   aws batch describe-service-environments \
       --service-environment my-sagemaker-service-env
   ```

The service environment transitions to `DELETING` state during the deletion process. Once deletion completes, the service environment is no longer listed in describe operations. Associated job queues remain but cannot process service jobs until associated with a different service environment.

------