MLCOST06-BP03 Monitor endpoint usage and right-size the instance fleet

Use efficient compute resources to run models in production. Monitor your endpoint usage and right-size the instance fleet. Use automatic scaling (auto scaling) for your hosted models. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload.

Desired outcome: You have optimized SageMaker AI endpoints that automatically adjust to workload demands while maintaining performance and minimizing costs. Your model deployment uses appropriately sized instances that are neither over-provisioned nor under-provisioned, and you have continuous monitoring in place to inform scaling decisions.

Common anti-patterns:

Provisioning static endpoint configurations that remain unchanged regardless of workload fluctuations.
Over-provisioning instances "just to be safe" without analyzing actual resource utilization.
Ignoring endpoint metrics and failing to adjust resource allocation based on usage patterns.
Deploying resources across different Availability Zones without consideration for data transfer costs.
Using default instance types without evaluating performance requirements.

Benefits of establishing this best practice:

Reduced compute costs by reducing over-provisioned resources.
Improved performance during peak usage periods through automatic scaling.
Higher resource utilization through right-sizing.
Increased availability by distributing instances across Availability Zones.
Better understanding of model usage patterns to inform future optimizations.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Monitoring and optimizing your SageMaker AI endpoints is essential for maintaining cost-efficiency while providing high availability and performance. By implementing CloudWatch monitoring and auto scaling, your deployments use only the resources they needs when they need them. Start by establishing baseline metrics for your endpoints to understand typical usage patterns and resource requirements. Then implement auto scaling policies based on these metrics to automatically adjust capacity in response to changing workloads.

For production environments, distribute your endpoint deployment across multiple Availability Zones to maintain high availability. Consider the placement of related resources, such as data storage solutions like FSx for Lustre, to minimize cross-AZ data transfer costs and optimize performance. Regular review of your metrics and scaling configurations assists you to continuously refine your deployment for optimal cost and performance.

Implementation steps

Monitor Amazon SageMaker AI endpoints with Amazon CloudWatch. You can monitor Amazon SageMaker AI using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. Use metrics such as CPUUtilization, GPUUtilization, MemoryUtilization, and DiskUtilization to view your endpoint's resource utilization and make informed decisions about right-sizing your endpoint instances. Set up CloudWatch dashboards to visualize these metrics over time and identify patterns in resource usage.
Implement CloudWatch alarms for proactive monitoring. Configure alarms for key metrics that can indicate when an endpoint is under-provisioned or over-provisioned. For example, set up alarms that go off when CPU utilization consistently exceeds 80% (indicating potential under-provisioning) or remains below 20% (indicating over-provisioning). These alarms can notify your team to take action or run automated responses through AWS Lambda functions.
Configure auto scaling for SageMaker AI endpoints. Amazon SageMaker AI supports auto scaling that monitors your workloads and dynamically adjusts capacity to maintain steady performance at the lowest possible cost. When workload increases, auto scaling brings more instances online. When workload decreases, auto scaling removes unnecessary instances, which can reduce compute costs. Define appropriate scaling policies based on your application's requirements, including minimum and maximum instance counts, target metrics, and scale-in and scale-out cooldown periods.
Distribute instances across Availability Zones. SageMaker AI automatically attempts to distribute your instances across Availability Zones, so deploy multiple instances for each production endpoint to provide high availability. If you're using a VPC, configure at least two subnets in different Availability Zones to allow SageMaker AI to distribute your instances across those zones, providing resilience against zone failures.
Optimize resource placement for data access. When using Amazon FSx for Lustre as an input data source for SageMaker AI, deploy FSx for Lustre and SageMaker AI in the same Availability Zone to avoid cross-AZ data transfer costs. This configuration removes the initial Amazon S3 download step, accelerating ML training jobs while minimizing costs. Consider similar placement strategies for other related resources to optimize performance and cost.
Regularly review and adjust instance types. Periodically evaluate whether your selected instance types are appropriate for your workload. SageMaker AI offers a variety of instance types optimized for different workload characteristics. Analyze your CloudWatch metrics to determine if you could achieve better price-performance by switching to a different instance family, such as compute-optimized, memory-optimized, or GPU instances.
Use inference optimization techniques. Implement model optimization techniques such as Amazon SageMaker AI Neo to automatically optimize models for your target hardware, improving performance and potentially allowing you to use smaller instance types. Consider techniques like model compression, quantization, and batching to improve inference efficiency and throughput.
Use enhanced SageMaker AI Inference Recommender. Use SageMaker AI Inference Recommender with enhanced algorithms and support for multi-model endpoints to get sophisticated instance selection and cost optimization recommendations.
Implement specialized instance types for generative AI models. For large language models and other generative AI workloads, use specialized instances like AWS Inferentia or AWS Trainium, which are designed specifically for machine learning inference and training. These instances can provide significant cost savings compared to general-purpose GPU instances when running transformer-based models. Consider Amazon Bedrock for fully managed generative AI capabilities with built-in scaling.

Resources

Related documents:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

MLCOST06-BP02 Monitor return on investment for ML models

MLCOST06-BP04 Enable debugging and logging