View a markdown version of this page

GENCOST02-BP02 Optimize resource consumption to minimize hosting costs - Generative AI Lens

GENCOST02-BP02 Optimize resource consumption to minimize hosting costs

Hosting a foundation model for inference requires myriad choices, all of which affect cost. These cost dimensions can be optimized to reduce cost while meeting performance goals.

Desired outcome: When implemented, this best practice describes a relationship between cost and performance contextualized in self-hosted foundation model hosting.

Benefits of establishing this best practice:

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Self-hosted model infrastructure should be optimized based on the model used and the workload's usage pattern. Customers self-hosting models should also consider optimizing the model's hosting infrastructure. Consider right-sizing the inference endpoint to the smallest instance available that allows you to meet performance goals. In some scenarios, it may be appropriate to shut down the hosting instance and restart it during relevant hours. This is particularly useful for workloads with predictable usage patterns. You may also consider purchasing Amazon EC2 Reserved Instances or Savings Plans to further reduce the cost of a hosted model endpoint. Before committing to compute reservation, consider Amazon SageMaker AI Inference Recommender to evaluate if you are using the ideal inference endpoint type, generation, and size.

In SageMaker AI HyperPod with both Amazon EKS and Slurm orchestration, use the system's advanced task governance capabilities and flexible training plans to dynamically allocate compute resources based on priority and demand, reducing costs through improved utilization.

For EKS-based HyperPod, implement the managed Kubernetes orchestration with Hyperpod Task Governance. Configure automated scaling policies, priority classes, and node selectors to verify that your production workloads use cost-effective committed capacity while development tasks use On-Demand or Spot Instances when appropriate. Use the usage reporting feature to provide granular visibility into GPU, CPU, and Neuron Core consumption at both team and task levels, enabling transparent cost attribution and reducing guesswork in resource allocation.

For Slurm-based HyperPod, use Slurm's native job scheduling and resource management features combined with HyperPod's auto-resume functionality to minimize wasted compute cycles during hardware failures, potentially reducing total training time in large clusters. Both systems benefit from implementing right-sizing strategies through SageMaker AI HyperPod Recipes that provide pre-configured, benchmarked training stacks optimized for specific model architectures like Llama and Mistral, providing optimized performance while minimizing resource waste.

Additionally, establish flexible training plans that can set timeline and budget constraints, and allow HyperPod to automatically find the best combination of capacity blocks and create cost-optimized execution plans that avoid overspending by overprovisioning servers for training jobs.

Inference workloads can be optimized using advanced techniques such as quantization or LoRA adaptation. These advanced capabilities are available for certain models in Amazon Bedrock or on self-hosted models on Amazon SageMaker AI. These advanced inference techniques can further optimize resource consumption for inference, thus reducing hosting and inference serving costs.

Implementation steps

  1. Identify the nature of the demand for this workload.

  2. Deploy selected foundation model on acceptable infrastructure, even if it may be over-provisioned.

  3. Establish an inference or demand profile for the hosted workload.

  4. Optimize the hosting infrastructure in accordance with the workload's demands, and select the most cost optimized infrastructure that meets performance requirements.

Resources

Related best practices:

Related videos and documents:

Related examples: