Automatic scaling of Amazon SageMaker AI models

Amazon SageMaker AI supports automatic scaling (auto scaling) for your hosted models. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances so that you don't pay for provisioned instances that you aren't using. For more information about using per-instance metrics for scaling decisions, see Amazon SageMaker AI enhanced metrics for inference endpoints and Enhanced metrics for Amazon SageMaker AI endpoints.

Topics

Auto scaling policy overview
Auto scaling prerequisites
Configure model auto scaling with the console
Register a model
Define a scaling policy
Apply a scaling policy
Instructions for editing a scaling policy
Temporarily turn off scaling policies
Delete a scaling policy
Check the status of a scaling activity by describing scaling activities
Scale an endpoint to zero instances
Load testing your auto scaling configuration
Use CloudFormation to create a scaling policy
Update endpoints that use auto scaling
Delete endpoints configured for auto scaling

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Delete Endpoints and Resources

Auto scaling policy overview