Update your AMI version in your SageMaker HyperPod cluster
Amazon SageMaker HyperPod Amazon Machine Images (AMIs) are specialized machine images for distributed machine learning workloads and high-performance computing. Each AMI comes pre-loaded with drivers, machine learning frameworks, training libraries, and performance monitoring tools. By updating the AMI version in your cluster, you can use the latest versions of these components and packages for your training jobs and workflows.
When updating the AMI version within your cluster, you have the option to process the update immediately, schedule a one-time only update, or use a cron expression to create a recurring schedule. You can also choose to update all of the instances in an instance group or just batches of instances. If you choose to update batches, you set the percentage or amount of instances that SageMaker AI should upgrade at a time. If you use this method of updating, you set an interval of how long SageMaker AI should wait in between batches.
If you choose to update in batches, you can also include a list of alarms and metrics.
During the wait interval, SageMaker AI observes these metrics and if any exceed their threshold,
the corresponding alarm goes into the ALARM state, and SageMaker AI rolls back the AMI update.
To utilize automatic rollbacks, your IAM execution role must have the permission
cloudwatch:DescribeAlarms.
Note
Updating your cluster in batches is available only for HyperPod clusters integrated with Amazon EKS. Also, if you’re creating multiple schedules, we recommend that you have a time buffer in between schedules. If schedules overlap, updates might fail.
For more information about each AMI release for your HyperPod cluster, see Amazon SageMaker HyperPod AMI. For more information about general HyperPod releases, see Amazon SageMaker HyperPod release notes.
You can use the SageMaker AI API or CLI operations to update your cluster or see scheduled updates for a specific cluster. If you're using the AWS console, follow these steps:
Note
Updating your AMI with the AWS console is available only for clusters integrated with Amazon EKS. If you have a Slurm cluster, you must use the SageMaker AI API or CLI operations.
-
Open the Amazon SageMaker AI console at https://console.aws.amazon.com/sagemaker/
. -
On the left, expand HyperPod Clusters, and choose Cluster Management.
-
Choose the cluster that you want to update, then choose Details, and Update AMI.
To create and manage update schedules programmatically, use the following API operations:
-
CreateCluster – create a cluster while specifying an update schedule
-
UpdateCluster – update a cluster to add an update schedule
-
UpdateClusterSoftware – to update the platform software of a cluster
-
DescribeCluster – see an update schedule you created for a cluster
-
DescribeClusterNode and ListClusterNodes – see when the cluster was last updated.
Required permissions
Depending to how you configured your Pod
Disruption Budget
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: hyperpod-patching rules: - apiGroups: [""] resources: ["pods"] verbs: ["list"] - apiGroups: [""] resources: ["pods/eviction"] verbs: ["create"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: hyperpod-patching subjects: - kind: User name: hyperpod-service-linked-role roleRef: kind: ClusterRole name: hyperpod-patching apiGroup: rbac.authorization.k8s.io
Use the following commands to apply the permissions.
git clone https://github.com/aws/sagemaker-hyperpod-cli.git cd sagemaker-hyperpod-cli/helm_chart helm upgrade hyperpod-dependencies HyperPodHelmChart --namespace kube-system --install
Cron expressions
To configure a one-time update at a certain time or a recurring schedule, use cron expressions. Cron expressions support six fields and are separated by white space. All six fields are required.
cron(MinutesHoursDay-of-monthMonthDay-of-weekYear)
| Fields | Values | Wildcards |
|---|---|---|
|
Minutes |
00 – 59 |
N/A |
|
Hours |
00 – 23 |
N/A |
|
Day-of-month |
01 – 31 |
? |
|
Month |
01 – 12 |
* / |
|
Day-of-week |
1 – 7 or MON-SUN |
? # L |
|
Year |
Current year – 2099 |
* |
Wildcards
-
The * (asterisk) wildcard includes all values in the field. In the
Hoursfield, * would include every hour. -
The / (forward slash) wildcard specifies increments. In the
Monthsfield, you could enter*/3to specify every 3rd month. -
The ? (question mark) wildcard specifies one or another. In the
Day-of-monthfield you could enter 7, and if you didn't care what day of the week the seventh was, you could enter ? in the Day-of-week field. -
The L wildcard in the
day-of-weekor field specifies the last day of the month or week. For example,5Lmeans the last Friday of the month. -
The # wildcard in the ay-of-week field specifies a certain instance of the specified day of the week within a month. For example, 3#2 would be the second Tuesday of the month: the 3 refers to Tuesday because it is the third day of each week, and the 2 refers to the second day of that type within the month.
You can use cron expressions for the following scenarios:
-
One-time schedule that runs at a certain time and day. You can use the
?wildcard to denote that day-of-month or day-of-week don't matter.cron(30 14 ? 12 MON 2024)cron(30 14 15 12 ? 2024) -
A weekly schedule that runs at a certain time and day. The following example creates a schedule that runs at 12:00pm on every Monday regardless of day-of-month.
cron(00 12 ? * 1 *) -
Monthly schedule that runs every month regardless of the day-of-week. The following schedule runs at 12:30pm on the 15th of every month.
cron(30 12 15 * ? *) -
A monthly schedule that uses day-of-week.
cron(30 12 ? * MON *) -
To create a schedule that runs every Nth month, use the
/wildcard. The following example creates a monthly schedule that runs every 3 months. The following two examples demonstrate how it works with day-of-week and day-of-month.cron(30 12 15 */3 ? *)cron(30 12 ? */3 MON *) -
A schedule that runs on a certain instance of the specified day of the week. The following example creates a schedule that runs at 12:30pm on the second Monday of every month.
cron(30 12 ? * 1#2 *) -
A schedule that runs on the last instance of the specified day of the week. The following schedule runs at 12:30pm on the last Monday of every month.
cron(30 12 ? * 1L *)