本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。
使用 Karpenter 自動擴展來建立和設定 HyperPod 叢集
在下列步驟中,您將建立已啟用持續佈建的 SageMaker HyperPod 叢集,並將其設定為使用 Karpenter 型自動擴展。
建立 HyperPod 叢集
-
載入您的環境組態,並從 CloudFormation 堆疊中擷取值。
source .env SUBNET1=$(cfn-output $VPC_STACK_NAME PrivateSubnet1) SUBNET2=$(cfn-output $VPC_STACK_NAME PrivateSubnet2) SUBNET3=$(cfn-output $VPC_STACK_NAME PrivateSubnet3) SECURITY_GROUP=$(cfn-output $VPC_STACK_NAME NoIngressSecurityGroup) EKS_CLUSTER_ARN=$(cfn-output $EKS_STACK_NAME ClusterArn) EXECUTION_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ExecutionRole) SERVICE_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ServiceRole) BUCKET_NAME=$(cfn-output $SAGEMAKER_STACK_NAME Bucket) HP_CLUSTER_NAME="hyperpod-eks-test-$(date +%s)" EKS_CLUSTER_NAME=$(cfn-output $EKS_STACK_NAME ClusterName) HP_CLUSTER_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ClusterRole) -
將節點初始化指令碼上傳至您的 Amazon S3 儲存貯體。
aws s3 cp lifecyclescripts/on_create_noop.sh s3://$BUCKET_NAME -
使用您的環境變數建立叢集組態檔案。
cat > cluster_config.json << EOF { "ClusterName": "$HP_CLUSTER_NAME", "InstanceGroups": [ { "InstanceCount": 1, "InstanceGroupName": "system", "InstanceType": "ml.c5.xlarge", "LifeCycleConfig": { "SourceS3Uri": "s3://$BUCKET_NAME", "OnCreate": "on_create_noop.sh" }, "ExecutionRole": "$EXECUTION_ROLE" }, { "InstanceCount": 0, "InstanceGroupName": "auto-c5-az1", "InstanceType": "ml.c5.xlarge", "LifeCycleConfig": { "SourceS3Uri": "s3://$BUCKET_NAME", "OnCreate": "on_create_noop.sh" }, "ExecutionRole": "$EXECUTION_ROLE" }, { "InstanceCount": 0, "InstanceGroupName": "auto-c5-4xaz2", "InstanceType": "ml.c5.4xlarge", "LifeCycleConfig": { "SourceS3Uri": "s3://$BUCKET_NAME", "OnCreate": "on_create_noop.sh" }, "ExecutionRole": "$EXECUTION_ROLE", "OverrideVpcConfig": { "SecurityGroupIds": [ "$SECURITY_GROUP" ], "Subnets": [ "$SUBNET2" ] } }, { "InstanceCount": 0, "InstanceGroupName": "auto-g5-az3", "InstanceType": "ml.g5.xlarge", "LifeCycleConfig": { "SourceS3Uri": "s3://$BUCKET_NAME", "OnCreate": "on_create_noop.sh" }, "ExecutionRole": "$EXECUTION_ROLE", "OverrideVpcConfig": { "SecurityGroupIds": [ "$SECURITY_GROUP" ], "Subnets": [ "$SUBNET3" ] } } ], "VpcConfig": { "SecurityGroupIds": [ "$SECURITY_GROUP" ], "Subnets": [ "$SUBNET1" ] }, "Orchestrator": { "Eks": { "ClusterArn": "$EKS_CLUSTER_ARN" } }, "ClusterRole": "$HP_CLUSTER_ROLE", "AutoScaling": { "Mode": "Enable", "AutoScalerType": "Karpenter" }, "NodeProvisioningMode": "Continuous" } EOF -
執行下列命令以建立您的 HyperPod 叢集。
aws sagemaker create-cluster --cli-input-json file://./cluster_config.json -
叢集建立程序約需 20 分鐘的時間。監控叢集狀態,直到 ClusterStatus 和 AutoScaling.Status 都顯示 InService。
-
儲存叢集 ARN 以進行後續操作。
HP_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME \ --output text --query ClusterArn)
啟用 Karpenter 自動擴展
-
執行下列命令,在具有連續節點佈建模式的任何預先存在叢集上啟用 Karpenter 型自動擴展。
aws sagemaker update-cluster \ --cluster-name $HP_CLUSTER_NAME \ --auto-scaling Mode=Enable,AutoScalerType=Karpenter \ --cluster-role $HP_CLUSTER_ROLE -
驗證是否已成功啟用 Karpenter:
aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --query 'AutoScaling' -
預期的輸出結果:
{ "Mode": "Enable", "AutoScalerType": "Karpenter", "Status": "InService" }
等待 Status 顯示 InService,再繼續設定 NodeClass 和 NodePool。