Prerequisites Install the HyperPod CLI and SDK Configure your cluster context Choose your scenario

Train and deploy models with HyperPod CLI and SDK

Amazon SageMaker HyperPod helps you train and deploy machine learning models at scale. The AWS HyperPod CLI is a unified command-line interface that simplifies machine learning (ML) workflows on AWS. It abstracts infrastructure complexities and provides a streamlined experience for submitting, monitoring, and managing ML training jobs. The CLI is designed specifically for data scientists and ML engineers who want to focus on model development rather than infrastructure management. This topic walks you through three key scenarios: training a PyTorch model, deploying a custom model using trained artifacts, and deploying a JumpStart model. Designed for first-time users, this concise tutorial ensures you can set up, train, and deploy models effortlessly using either the HyperPod CLI or the SDK. The handshake process between training and inference helps you manage model artifacts effectively.

Prerequisites

Before you begin using Amazon SageMaker HyperPod, make sure you have:

An AWS account with access to Amazon SageMaker HyperPod
Python 3.9, 3.10, or 3.11 installed
AWS CLI configured with appropriate credentials.

Install the HyperPod CLI and SDK

Install the required package to access the CLI and SDK:


pip install sagemaker-hyperpod

This command sets up the tools needed to interact with HyperPod clusters.

Configure your cluster context

HyperPod operates on clusters optimized for machine learning. Start by listing available clusters to select one for your tasks.

List all available clusters:
```
hyp list-cluster
```

Choose and set your active cluster:


hyp set-cluster-context your-eks-cluster-name

Verify the configuration:
```
hyp get-cluster-context
```

Note

All subsequent commands target the cluster you've set as your context.

Choose your scenario

For detailed instructions on each scenario, click on the topics below:

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Suggested resilience configurations

Train a PyTorch model