GENCOST02-BP02 Optimize resource consumption to minimize hosting costs
Hosting a foundation model for inference requires myriad choices, all of which affect cost. These cost dimensions can be optimized to reduce cost while meeting performance goals.
Desired outcome: When implemented, this best practice describes a relationship between cost and performance contextualized in self-hosted foundation model hosting.
Benefits of establishing this best practice:
-
Measure overall efficiency - It is helpful to understand inference and hosting costs associated with the performance requirements of foundation model.
-
Stop spending money on undifferentiated heavy lifting - More often than not, it is beneficial to opt for a managed or serverless hosting paradigm, due to the intractability of the total cost of ownership for foundation model hosting.
Level of risk exposed if this best practice is not established: Medium
Implementation guidance
Self-hosted model infrastructure should be optimized based on the
model used and the workload's usage pattern. Customers
self-hosting models should also consider optimizing the model's
hosting infrastructure. Consider right-sizing the inference
endpoint to the smallest instance available that allows you to
meet performance goals. In some scenarios, it may be appropriate
to shut down the hosting instance and restart it during relevant
hours. This is particularly useful for workloads with predictable
usage patterns. You may also consider purchasing Amazon EC2 Reserved
Instances
In SageMaker AI HyperPod with both Amazon EKS and Slurm orchestration, use the system's advanced task governance capabilities and flexible training plans to dynamically allocate compute resources based on priority and demand, reducing costs through improved utilization.
For EKS-based HyperPod, implement the managed Kubernetes orchestration with Hyperpod Task Governance. Configure automated scaling policies, priority classes, and node selectors to verify that your production workloads use cost-effective committed capacity while development tasks use On-Demand or Spot Instances when appropriate. Use the usage reporting feature to provide granular visibility into GPU, CPU, and Neuron Core consumption at both team and task levels, enabling transparent cost attribution and reducing guesswork in resource allocation.
For Slurm-based HyperPod, use Slurm's native job scheduling and resource management features combined with HyperPod's auto-resume functionality to minimize wasted compute cycles during hardware failures, potentially reducing total training time in large clusters. Both systems benefit from implementing right-sizing strategies through SageMaker AI HyperPod Recipes that provide pre-configured, benchmarked training stacks optimized for specific model architectures like Llama and Mistral, providing optimized performance while minimizing resource waste.
Additionally, establish flexible training plans that can set timeline and budget constraints, and allow HyperPod to automatically find the best combination of capacity blocks and create cost-optimized execution plans that avoid overspending by overprovisioning servers for training jobs.
Inference workloads can be optimized using advanced techniques such as quantization or LoRA adaptation. These advanced capabilities are available for certain models in Amazon Bedrock or on self-hosted models on Amazon SageMaker AI. These advanced inference techniques can further optimize resource consumption for inference, thus reducing hosting and inference serving costs.
Implementation steps
-
Identify the nature of the demand for this workload.
-
Deploy selected foundation model on acceptable infrastructure, even if it may be over-provisioned.
-
Establish an inference or demand profile for the hosted workload.
-
Optimize the hosting infrastructure in accordance with the workload's demands, and select the most cost optimized infrastructure that meets performance requirements.
Resources
Related best practices:
Related videos and documents:
Related examples:
-
Track, allocate and manage your generative AI cost and usage with Amazon Bedrock
-
SageMaker AI Inference Recommender for HuggingFace BERT Sentiment Analysis
-
Introducing Amazon SageMaker AI HyperPod to train foundation models at scale
-
Best practices for Amazon SageMaker AI HyperPod task governance
-
Get started with Amazon SageMaker AI HyperPod task governance
-
Usage reporting for cost attribution in SageMaker AI HyperPod