Distributed availability

GENREL05: How do you distribute inference workloads over multiple regions of availability?

Generative AI applications can be as simple as prompt-response workflows against a single foundation model or as advanced as multi-agent orchestration. The various components associated with a generative AI workload are required to service a region of availability. Availability could be over a well-defined zone or it could be expansive covering large geographic areas. Architecting for this variability is a complex problem.

Best practices

GENREL05-BP01 Load-balance inference requests across all regions of availability
GENREL05-BP02 Replicate embedding data across all regions of availability
GENREL05-BP03 Verify that agent capabilities are available across all regions of availability

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

GENREL04-BP02 Implement a model catalog

GENREL05-BP01 Load-balance inference requests across all regions of availability