Deploying models on Amazon SageMaker HyperPod

Amazon SageMaker HyperPod now extends beyond training to deliver a comprehensive inference platform that combines the flexibility of Kubernetes with the operational excellence of AWS managed services. Deploy, scale, and optimize your machine learning models with enterprise-grade reliability using the same HyperPod compute throughout the entire model lifecycle.

Amazon SageMaker HyperPod offers flexible deployment interfaces that allow you to deploy models through multiple methods including kubectl, Python SDK, Amazon SageMaker Studio UI, or HyperPod CLI. The service provides advanced autoscaling capabilities with dynamic resource allocation that automatically adjusts based on demand. Additionally, it includes comprehensive observability and monitoring features that track critical metrics such as time-to-first-token, latency, and GPU utilization to help you optimize performance.

Note

When deploying on GPU-enabled instances, you can use GPU partitioning with Multi-Instance GPU (MIG) technology to run multiple inference workloads on a single GPU. This allows for better GPU utilization and cost optimization. For more information about configuring GPU partitioning, see Using GPU partitions in Amazon SageMaker HyperPod.

Unified infrastructure for training and inference

Maximize your GPU utilization by seamlessly transitioning compute resources between training and inference workloads. This reduces the total cost of ownership while maintaining operational continuity.

Enterprise-ready deployment options

Deploy models from multiple sources including open-weights and gated models from Amazon SageMaker JumpStart and custom models from Amazon S3 and Amazon FSx with support for both single-node and multi-node inference architectures.

Managed tiered Key-value (KV) caching and intelligent routing

KV caching saves the precomputed key-value vectors after processing previous tokens. When the next token is processed, the vectors don't need to be recalculated. Through a two-tier caching architecture, you can configure an L1 cache that uses CPU memory for low-latency local reuse, and an L2 cache that leverages Redis to enable scalable, node-level cache sharing.

Intelligent routing analyzes incoming requests and directs them to the inference instance most likely to have relevant cached key-value pairs. The system examines the request and then routes it based on one of the following routing strategies:

prefixaware — Subsequent requests with the same prompt prefix are routed to the same instance
kvaware — Incoming requests are routed to the instance with the highest KV cache hit rate.
session — Requests from the same user session are routed to the same instance.
roundrobin — Even distribution of requests without considering the state of the KV cache.

For more information on how to enable this feature, see Configure KV caching and intelligent routing for improved performance.

Inbuilt L2 cache Tiered Storage Support for KV Caching

Building upon the existing KV cache infrastructure, HyperPod now integrates tiered storage as an additional L2 backend option alongside Redis. With the inbuilt SageMaker managed tiered storage, this offers improved performance. This enhancement provides customers with a more scalable and efficient option for cache offloading, particularly beneficial for high-throughput LLM inference workloads. The integration maintains compatibility with existing vLLM model servers and routing capabilities while offering better performance.

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Topology-aware scheduling

Setting up your HyperPod clusters for model deployment