GENREL05-BP01 Load-balance inference requests across all regions of availability

Inference to a foundation model may be available over a local or large area of availability. Verify that you have resources available across that area to service inference requests reliably regardless of where they are coming from.

Desired outcome: When implemented, this best practice improves the reliability of your generative AI workload by creating a highly available environment for serving inference requests.

Benefits of establishing this best practice: Scale horizontally to increase aggregate workload availability - Load-balanced inference requests across horizontally scaled infrastructure enable inference requests to be serviced evenly across a region of availability.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Use load balancing and multi-Region deployment strategies to distribute inference requests across multiple AWS Regions and Availability Zones. This helps maintain consistent performance and availability in the face of regional disruptions or network issues. Consider using Amazon Bedrock's cross-Region inference profiles to route requests to the nearest available endpoint. For self-hosted models on Amazon SageMaker AI, implement a multi-AZ deployment with an Amazon SageMaker AI Inference Endpoint configured for auto-scaling to automatically distribute and scale traffic across Regions.

This strategy provides improved reliability, reduced risk of single points of failure, and better geographic coverage for global users. Potential trade-offs include increased network latency and operational complexity.

Implementation steps

Configure Amazon Bedrock cross-Region inference profiles or deploy self-hosted models on Amazon SageMaker AI Inference Endpoints across multiple Availability Zones.
Set up an Amazon SageMaker AI Inference Endpoint with auto-scaling enabled to distribute traffic based on health and latency.
Implement health checks and automated failover to maintain availability.
Monitor performance metrics like latency, error rates, and throughput across Regions.

Resources

Related best practices:

Related documents:

Supported Regions and models for inference profiles

Related examples:

Getting Started with cross-Region inference in Amazon Bedrock

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Distributed availability

GENREL05-BP02 Replicate embedding data across all Regions of availability