本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。
設定 HyperPod 叢集以進行模型部署
本指南說明如何在 Amazon SageMaker HyperPod 叢集上啟用推論功能。您將設定機器學習工程師部署和管理推論端點所需的基礎設施、許可和運算子。
注意
若要建立已預先安裝推論運算子的叢集,請參閱 建立 EKS 協作的 SageMaker HyperPod 叢集。若要在現有叢集上安裝推論運算子,請繼續執行下列程序。
您可以使用 SageMaker AI 主控台安裝推論運算子,以獲得簡化的體驗,或使用 AWS CLI 進行更多控制。本指南涵蓋兩種安裝方法。
方法 1:透過 SageMaker AI 主控台安裝 HyperPod 推論附加元件 (建議)
SageMaker AI 主控台提供兩種安裝選項的最簡化體驗:
-
快速安裝:使用最佳化預設值自動建立所有必要的資源,包括 IAM 角色、Amazon S3 儲存貯體和相依性附加元件。將建立新的 Studio 網域,其中包含將 JumpStart 模型部署至相關叢集所需的許可。此選項非常適合以最少的組態決策快速入門。
-
自訂安裝:提供彈性來指定現有資源或自訂組態,同時維持一鍵式體驗。客戶可以選擇根據組織需求重複使用現有的 IAM 角色、Amazon S3 儲存貯體或相依性附加元件。
先決條件
-
具有 Amazon EKS 協同運作的現有 HyperPod 叢集
-
Amazon EKS 叢集管理的 IAM 許可
-
為叢集存取設定的 kubectl
安裝步驟
-
導覽至 SageMaker AI 主控台,然後前往 HyperPod 叢集 → 叢集管理。
-
選取您要安裝推論運算子的叢集。
-
導覽至推論索引標籤。針對自動設定選取快速安裝,或針對組態彈性選取自訂安裝。
-
如果選擇自訂安裝,請視需要指定現有的資源或自訂設定。
-
按一下安裝以開始自動安裝程序。
-
透過 主控台或執行下列命令來驗證安裝狀態:
kubectl get pods -n hyperpod-inference-systemaws eks describe-addon --cluster-name CLUSTER-NAME --addon-name amazon-sagemaker-hyperpod-inference --region REGION
成功安裝附加元件後,您可以使用模型部署文件部署模型,或導覽至 驗證推論運算子是否運作中。
方法 2:使用 AWS CLI 安裝推論運算子
AWS CLI 安裝方法提供更多對安裝程序的控制,並適用於自動化和進階組態。
先決條件
推論運算子可讓您在 Amazon EKS 叢集上部署和管理機器學習推論端點。在安裝之前,請確定您的叢集具有所需的安全組態和支援基礎設施。完成下列步驟以設定 IAM 角色、安裝 AWS Load Balancer控制器、設定 Amazon S3 和 Amazon FSx CSI 驅動程式,以及部署 KEDA 和 cert-manager:
注意
或者,您可以使用 CloudFormation 範本來自動化先決條件設定。如需詳細資訊,請參閱使用 CloudFormation 範本建立先決條件堆疊。
連接至您的叢集並設定環境變數
在繼續之前,請確認您的 AWS 登入資料已正確設定並具有必要的許可。使用具有管理員權限的 IAM 主體和對 Amazon EKS 叢集的叢集管理員存取權來執行下列步驟。請確定您已使用 建立 HyperPod 叢集使用 Amazon EKS 協同運作建立 SageMaker HyperPod 叢集。安裝 helm、eksctl 和 kubectl 命令列公用程式。
對於 Amazon EKS 叢集的 Kubernetes 管理存取權,請開啟 Amazon EKS 主控台,然後選取您的叢集。在存取索引標籤中,選取 IAM 存取項目。如果您的 IAM 主體沒有項目,請選取建立存取項目。選取所需的 IAM 主體,並將 AmazonEKSClusterAdminPolicy 與其建立關聯。
-
設定 kubectl 以連線至 Amazon EKS 叢集協調的新建立 HyperPod 叢集。指定區域和 HyperPod 叢集名稱。
export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name> export REGION=<region> # S3 bucket where tls certificates will be uploaded export BUCKET_NAME="hyperpod-tls-<your-bucket-suffix>" # Bucket should have prefix: hyperpod-tls-* export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text | \ cut -d'/' -f2) aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION注意
如果使用的自訂儲存貯體名稱不是以 開頭
hyperpod-tls-,請將下列政策連接至您的執行角色:{ "Version": "2012-10-17", "Statement": [ { "Sid": "TLSBucketDeleteObjectsPermission", "Effect": "Allow", "Action": ["s3:DeleteObject"], "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"], "Condition": { "StringEquals": { "aws:ResourceAccount": "${aws:PrincipalAccount}" } } }, { "Sid": "TLSBucketGetObjectAccess", "Effect": "Allow", "Action": ["s3:GetObject"], "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"] }, { "Sid": "TLSBucketPutObjectAccess", "Effect": "Allow", "Action": ["s3:PutObject", "s3:PutObjectTagging"], "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"], "Condition": { "StringEquals": { "aws:ResourceAccount": "${aws:PrincipalAccount}" } } } ] } -
設定預設 env 變數。
HYPERPOD_INFERENCE_ROLE_NAME="SageMakerHyperPodInference-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_NAMESPACE="hyperpod-inference-system" -
從叢集 ARN 擷取 Amazon EKS 叢集名稱、更新本機 kubeconfig,並透過列出跨命名空間的所有 Pod 來驗證連線。
kubectl get pods --all-namespaces -
(選用) 安裝 NVIDIA 裝置外掛程式,以在叢集上啟用 GPU 支援。
# Install nvidia device plugin kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml # Verify that GPUs are visible to k8s kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu
為推論運算子設定 IAM 角色
-
收集必要的 AWS 資源識別符和 ARNs,以設定 Amazon EKS、SageMaker AI 和 IAM 元件之間的服務整合。
%%bash -x export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text) export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5) export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text) -
將 IAM OIDCidentity 提供者與您的叢集建立關聯。
eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve -
建立 HyperPod 推論運算子 IAM 角色所需的信任政策。這些政策可在 Amazon EKS、SageMaker AI 和其他 AWS 服務之間啟用安全的跨服務通訊。
%%bash -x # Create trust policy JSON cat << EOF > trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "sagemaker.amazonaws.com" ] }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager" } } } ] } EOF -
建立推論運算子的執行角色。
aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json aws iam attach-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodInferenceAccess -
建立推論運算子資源的命名空間
kubectl create namespace $HYPERPOD_INFERENCE_NAMESPACE
建立 ALB 控制器角色
-
建立信任政策和許可政策。
# Create trust policy cat <<EOF > /tmp/alb-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:aws-load-balancer-controller", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json # Create the role aws iam create-role \ --role-name alb-role \ --assume-role-policy-document file:///tmp/alb-trust-policy.json # Create the policy ALB_POLICY_ARN=$(aws iam create-policy \ --policy-name $ALBController_IAM_POLICY_NAME \ --policy-document file://AWSLoadBalancerControllerIAMPolicy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name alb-role \ --policy-arn $ALB_POLICY_ARN -
將標籤 (
kubernetes.io.role/elb) 套用至 Amazon EKS 叢集中的所有子網路 (公有和私有)。export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text) # Add Tags aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | \ xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1 # Verify Tags are added aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text -
建立 Amazon S3 VPC 端點。
aws ec2 create-vpc-endpoint \ --region ${REGION} \ --vpc-id ${VPC_ID} \ --vpc-endpoint-type Gateway \ --service-name "com.amazonaws.${REGION}.s3" \ --route-table-ids $(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
建立 KEDA 運算子角色
-
建立信任政策和許可政策。
# Create trust policy cat <<EOF > /tmp/keda-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:keda-operator", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy cat <<EOF > /tmp/keda-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "aps:QueryMetrics", "aps:GetLabels", "aps:GetSeries", "aps:GetMetricMetadata" ], "Resource": "*" } ] } EOF # Create the role aws iam create-role \ --role-name keda-operator-role \ --assume-role-policy-document file:///tmp/keda-trust-policy.json # Create the policy KEDA_POLICY_ARN=$(aws iam create-policy \ --policy-name KedaOperatorPolicy \ --policy-document file:///tmp/keda-policy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name keda-operator-role \ --policy-arn $KEDA_POLICY_ARN -
如果您使用的是門控模型,請建立 IAM 角色來存取門控模型。
-
建立 IAM 政策。
%%bash -s $REGION JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}" cat <<EOF > /tmp/trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account*", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } }, { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF -
建立 IAM 角色。
# Create the role using existing trust policy aws iam create-role \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --assume-role-policy-document file:///tmp/trust-policy.json aws iam attach-role-policy \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodGatedModelAccessJUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0] !echo $JUMPSTART_GATED_ROLE_ARN
-
安裝相依性 EKS 附加元件
安裝推論運算子之前,您必須在叢集上安裝下列必要的 EKS 附加元件。如果缺少任何這些相依性,推論運算子將無法安裝 。每個附加元件都有與推論附加元件相容的最低版本需求。
重要
在嘗試安裝推論運算子之前,先安裝所有相依性附加元件。缺少相依性會導致安裝失敗,並顯示特定錯誤訊息。
必要的附加元件
-
Amazon S3 掛載點 CSI 驅動程式 (最低版本:v1.14.1-eksbuild.1)
在推論工作負載中將 S3 儲存貯體掛載為持久性磁碟區時需要。
aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name aws-mountpoint-s3-csi-driver \ --region $REGION \ --service-account-role-arn $S3_CSI_ROLE_ARN如需包含必要 IAM 許可的詳細安裝說明,請參閱 Amazon S3 CSI 驅動程式掛載點。
-
Amazon FSx CSI 驅動程式 (最低版本:v1.6.0-eksbuild.1)
為高效能模型儲存體掛載 FSx 檔案系統時需要。
aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name aws-fsx-csi-driver \ --region $REGION \ --service-account-role-arn $FSX_CSI_ROLE_ARN如需包含必要 IAM 許可的詳細安裝說明,請參閱 Amazon FSx for Lustre CSI 驅動程式。
-
Metrics Server (最低版本:v0.7.2-eksbuild.4)
自動調整規模功能和資源指標集合的必要項目。
aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name metrics-server \ --region $REGION如需詳細安裝指示,請參閱指標伺服器。
-
Cert Manager (最低版本:v1.18.2-eksbuild.2)
安全推論端點的 TLS 憑證管理需要。
aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name cert-manager \ --region $REGION如需詳細安裝說明,請參閱 cert-manager。
驗證附加元件安裝
安裝必要的附加元件後,請確認它們執行正確:
# Check add-on status aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION # Verify pods are running kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)" kubectl get pods -n cert-manager
所有附加元件都應該顯示「ACTIVE」狀態,而且所有 Pod 都應該處於「執行中」狀態,才能繼續安裝推論運算子。
注意
如果您使用快速設定或自訂設定選項建立 HyperPod 叢集,則可能已安裝 FSx CSI 驅動程式和 Cert Manager。使用上述命令驗證其是否存在。
使用 EKS 附加元件安裝推論運算子
EKS 附加元件安裝方法提供具有自動更新和整合相依性驗證的受管體驗。這是安裝推論運算子的建議方法。
安裝推論運算子附加元件
-
收集所有必要ARNs 並建立組態檔案,以準備附加元件組態:
# Gather required ARNs export EXECUTION_ROLE_ARN=$(aws iam get-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text) export HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --region $REGION --query "ClusterArn" --output text) export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text) export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text) # Verify all ARNs are set correctly echo "Execution Role ARN: $EXECUTION_ROLE_ARN" echo "HyperPod Cluster ARN: $HYPERPOD_CLUSTER_ARN" echo "KEDA Role ARN: $KEDA_ROLE_ARN" echo "ALB Role ARN: $ALB_ROLE_ARN" echo "TLS S3 Bucket: $BUCKET_NAME" -
使用所有必要設定建立附加元件組態檔案:
cat > addon-config.json << EOF { "executionRoleArn": "$EXECUTION_ROLE_ARN", "tlsCertificateS3Bucket": "$BUCKET_NAME", "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN", "jumpstartGatedModelDownloadRoleArn": "$JUMPSTART_GATED_ROLE_ARN", "alb": { "serviceAccount": { "create": true, "roleArn": "$ALB_ROLE_ARN" } }, "keda": { "auth": { "aws": { "irsa": { "roleArn": "$KEDA_ROLE_ARN" } } } } } EOF # Verify the configuration file cat addon-config.json -
安裝推論運算子附加元件 (最低版本:v1.0.0-eksbuild.1):
aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --configuration-values file://addon-config.json \ --region $REGION -
監控安裝進度並驗證是否成功完成:
# Check installation status (repeat until status shows "ACTIVE") aws eks describe-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION \ --query "addon.{Status:status,Health:health}" \ --output table # Verify pods are running kubectl get pods -n hyperpod-inference-system # Check operator logs for any issues kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50
如需安裝問題的詳細疑難排解,請參閱 HyperPod 推論故障診斷。
若要驗證推論運算子是否正常運作,請繼續 驗證推論運算子是否運作中。
使用 CloudFormation 範本建立先決條件堆疊
除了手動設定先決條件外,您也可以使用 CloudFormation 範本自動建立推論運算子所需的 IAM 角色和政策。
-
設定輸入變數。將預留位置值取代為您自己的值:
#!/bin/bash set -e # ===== INPUT VARIABLES ===== HP_CLUSTER_NAME="my-hyperpod-cluster" # Replace with your HyperPod cluster name REGION="us-east-1" # Replace with your AWS region PREFIX="my-prefix" # Replace with your resource prefix SHORT_PREFIX="12a34d56" # Replace with your short prefix (maximum 8 characters) CREATE_DOMAIN="true" # Set to "false" if you don't need a SageMaker Studio domain STACK_NAME="hyperpod-inference-prerequisites" # Replace with your stack name TEMPLATE_URL="https://aws-sagemaker-hyperpod-cluster-setup-${REGION}-prod.s3.${REGION}.amazonaws.com/templates/main-stack-inference-operator-addon-template.yaml" -
衍生叢集和網路資訊:
# ===== DERIVE EKS CLUSTER NAME ===== EKS_CLUSTER_NAME=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'Orchestrator.Eks.ClusterArn' --output text | awk -F'/' '{print $NF}') echo "EKS_CLUSTER_NAME=$EKS_CLUSTER_NAME" # ===== GET VPC AND OIDC ===== VPC_ID=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.resourcesVpcConfig.vpcId' --output text) echo "VPC_ID=$VPC_ID" OIDC_PROVIDER=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.identity.oidc.issuer' --output text | sed 's|https://||') echo "OIDC_PROVIDER=$OIDC_PROVIDER" # ===== GET PRIVATE ROUTE TABLES ===== ALL_ROUTE_TABLES=$(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'RouteTables[].RouteTableId' --output text) EKS_PRIVATE_ROUTE_TABLES="" for rtb in $ALL_ROUTE_TABLES; do HAS_IGW=$(aws ec2 describe-route-tables --region $REGION --route-table-ids $rtb --query 'RouteTables[0].Routes[?GatewayId && starts_with(GatewayId, `igw-`)]' --output text 2>/dev/null) if [ -z "$HAS_IGW" ]; then EKS_PRIVATE_ROUTE_TABLES="${EKS_PRIVATE_ROUTE_TABLES:+$EKS_PRIVATE_ROUTE_TABLES,}$rtb" fi done echo "EKS_PRIVATE_ROUTE_TABLES=$EKS_PRIVATE_ROUTE_TABLES" # ===== CHECK S3 VPC ENDPOINT ===== S3_ENDPOINT_EXISTS=$(aws ec2 describe-vpc-endpoints --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" "Name=service-name,Values=com.amazonaws.$REGION.s3" --query 'VpcEndpoints[0].VpcEndpointId' --output text) CREATE_S3_ENDPOINT_STACK=$([ "$S3_ENDPOINT_EXISTS" == "None" ] && echo "true" || echo "false") echo "CREATE_S3_ENDPOINT_STACK=$CREATE_S3_ENDPOINT_STACK" # ===== GET HYPERPOD DETAILS ===== HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'ClusterArn' --output text) echo "HYPERPOD_CLUSTER_ARN=$HYPERPOD_CLUSTER_ARN" # ===== GET DEFAULT VPC FOR DOMAIN ===== DOMAIN_VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=isDefault,Values=true" --query 'Vpcs[0].VpcId' --output text) echo "DOMAIN_VPC_ID=$DOMAIN_VPC_ID" DOMAIN_SUBNET_IDS=$(aws ec2 describe-subnets --region $REGION --filters "Name=vpc-id,Values=$DOMAIN_VPC_ID" --query 'Subnets[0].SubnetId' --output text) echo "DOMAIN_SUBNET_IDS=$DOMAIN_SUBNET_IDS" # ===== GET INSTANCE GROUPS ===== INSTANCE_GROUPS=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'InstanceGroups[].InstanceGroupName' --output json | python3 -c "import sys, json; groups = json.load(sys.stdin); print('[' + ','.join([f'\\\\\\\"' + g + '\\\\\\\"' for g in groups]) + ']')") echo "INSTANCE_GROUPS=$INSTANCE_GROUPS" -
建立參數檔案並部署堆疊:
# ===== CREATE PARAMETERS JSON ===== cat > /tmp/cfn-params.json << EOF [ {"ParameterKey":"ResourceNamePrefix","ParameterValue":"$PREFIX"}, {"ParameterKey":"ResourceNameShortPrefix","ParameterValue":"$SHORT_PREFIX"}, {"ParameterKey":"VpcId","ParameterValue":"$VPC_ID"}, {"ParameterKey":"EksPrivateRouteTableIds","ParameterValue":"$EKS_PRIVATE_ROUTE_TABLES"}, {"ParameterKey":"EKSClusterName","ParameterValue":"$EKS_CLUSTER_NAME"}, {"ParameterKey":"OIDCProviderURLWithoutProtocol","ParameterValue":"$OIDC_PROVIDER"}, {"ParameterKey":"HyperPodClusterArn","ParameterValue":"$HYPERPOD_CLUSTER_ARN"}, {"ParameterKey":"HyperPodClusterName","ParameterValue":"$HP_CLUSTER_NAME"}, {"ParameterKey":"CreateDomain","ParameterValue":"$CREATE_DOMAIN"}, {"ParameterKey":"DomainVpcId","ParameterValue":"$DOMAIN_VPC_ID"}, {"ParameterKey":"DomainSubnetIds","ParameterValue":"$DOMAIN_SUBNET_IDS"}, {"ParameterKey":"CreateS3EndpointStack","ParameterValue":"$CREATE_S3_ENDPOINT_STACK"}, {"ParameterKey":"TieredStorageConfig","ParameterValue":"{\"Mode\":\"Enable\",\"InstanceMemoryAllocationPercentage\":20}"}, {"ParameterKey":"TieredKVCacheConfig","ParameterValue":"{\"KVCacheMode\":\"Enable\",\"InstanceGroup\":$INSTANCE_GROUPS,\"NVMeMode\":\"Enable\"}"} ] EOF echo -e "\n===== CREATING CLOUDFORMATION STACK =====" aws cloudformation create-stack \ --region $REGION \ --stack-name $STACK_NAME \ --template-url $TEMPLATE_URL \ --parameters file:///tmp/cfn-params.json \ --capabilities CAPABILITY_NAMED_IAM -
監控堆疊建立狀態:
aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --region $REGION \ --query 'Stacks[0].StackStatus' -
成功建立堆疊後,請擷取輸出值以用於推論運算子安裝:
aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --region $REGION \ --query 'Stacks[0].Outputs'
建立 CloudFormation 堆疊之後,請繼續使用 EKS 附加元件安裝推論運算子安裝推論運算子。
方法 3:Helm Chart 安裝
如果您需要更多控制安裝組態,或您的區域無法使用 EKS 附加元件,請使用此方法。
先決條件
在繼續之前,請確認您的 AWS 登入資料已正確設定並具有必要的許可。IAM 主體需要執行下列步驟,該主體具有 Amazon EKS 叢集的管理員權限和叢集管理員存取權。驗證您是否已使用 使用 Amazon EKS 協同運作建立 SageMaker HyperPod 叢集 建立 HyperPod 叢集。驗證您是否已安裝 helm、eksctl 和 kubectl 命令列公用程式。
如需 Amazon EKS 叢集的 Kubernetes 管理存取權,請前往 Amazon EKS 主控台,然後選取您正在使用的叢集。查看存取索引標籤,然後選取 IAM 存取項目。如果您的 IAM 主體沒有項目,請選取建立存取項目。然後選取所需的 IAM 主體,並將 AmazonEKSClusterAdminPolicy 與其建立關聯。
-
設定 kubectl 以連線至 Amazon EKS 叢集協調的新建立 HyperPod 叢集。指定區域和 HyperPod 叢集名稱。
export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name> export REGION=<region> # S3 bucket where tls certificates will be uploaded BUCKET_NAME="<Enter name of your s3 bucket>" # This should be bucket name, not URI export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text | \ cut -d'/' -f2) aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION -
設定預設 env 變數。
LB_CONTROLLER_POLICY_NAME="AWSLoadBalancerControllerIAMPolicy-$HYPERPOD_CLUSTER_NAME" LB_CONTROLLER_ROLE_NAME="aws-load-balancer-controller-$HYPERPOD_CLUSTER_NAME" S3_MOUNT_ACCESS_POLICY_NAME="S3MountpointAccessPolicy-$HYPERPOD_CLUSTER_NAME" S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$HYPERPOD_CLUSTER_NAME" KEDA_OPERATOR_POLICY_NAME="KedaOperatorPolicy-$HYPERPOD_CLUSTER_NAME" KEDA_OPERATOR_ROLE_NAME="keda-operator-role-$HYPERPOD_CLUSTER_NAME" PRESIGNED_URL_ACCESS_POLICY_NAME="PresignedUrlAccessPolicy-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_ACCESS_POLICY_NAME="HyperpodInferenceAccessPolicy-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_ROLE_NAME="HyperpodInferenceRole-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_SA_NAME="hyperpod-inference-operator-controller" HYPERPOD_INFERENCE_SA_NAMESPACE="hyperpod-inference-system" JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-$HYPERPOD_CLUSTER_NAME" FSX_CSI_ROLE_NAME="AmazonEKSFSxLustreCSIDriverFullAccess-$HYPERPOD_CLUSTER_NAME" -
從叢集 ARN 擷取 Amazon EKS 叢集名稱、更新本機 kubeconfig,並透過列出跨命名空間的所有 Pod 來驗證連線。
kubectl get pods --all-namespaces -
(選用) 安裝 NVIDIA 裝置外掛程式,以在叢集上啟用 GPU 支援。
#Install nvidia device plugin kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml # Verify that GPUs are visible to k8s kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu
準備您的環境以進行推論運算子安裝
-
收集必要的 AWS 資源識別符和 ARNs,以設定 Amazon EKS、SageMaker AI 和 IAM 元件之間的服務整合。
%%bash -x export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text) export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5) export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text) -
將 IAM OIDCidentity 提供者與您的叢集建立關聯。
eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve -
建立 HyperPod 推論運算子 IAM 角色所需的信任政策和許可政策 JSON 文件。這些政策可在 Amazon EKS、SageMaker AI 和其他 AWS 服務之間啟用安全的跨服務通訊。
bash # Create trust policy JSON cat << EOF > trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "sagemaker.amazonaws.com" ] }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager" } } } ] } EOF # Create permission policy JSON cat << EOF > permission-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "S3Access", "Effect": "Allow", "Action": [ "s3:Get*", "s3:List*", "s3:Describe*", "s3:PutObject" ], "Resource": [ "*" ] }, { "Sid": "ECRAccess", "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:DescribeImages", "ecr:BatchGetImage", "ecr:GetLifecyclePolicy", "ecr:GetLifecyclePolicyPreview", "ecr:ListTagsForResource", "ecr:DescribeImageScanFindings" ], "Resource": [ "*" ] }, { "Sid": "EC2Access", "Effect": "Allow", "Action": [ "ec2:AssignPrivateIpAddresses", "ec2:AttachNetworkInterface", "ec2:CreateNetworkInterface", "ec2:DeleteNetworkInterface", "ec2:DescribeInstances", "ec2:DescribeTags", "ec2:DescribeNetworkInterfaces", "ec2:DescribeInstanceTypes", "ec2:DescribeSubnets", "ec2:DetachNetworkInterface", "ec2:ModifyNetworkInterfaceAttribute", "ec2:UnassignPrivateIpAddresses", "ec2:CreateTags", "ec2:DescribeInstances", "ec2:DescribeInstanceTypes", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSubnets", "ec2:DescribeVolumes", "ec2:DescribeVolumesModifications", "ec2:DescribeVpcs", "ec2:CreateVpcEndpointServiceConfiguration", "ec2:DeleteVpcEndpointServiceConfigurations", "ec2:DescribeVpcEndpointServiceConfigurations", "ec2:ModifyVpcEndpointServicePermissions" ], "Resource": [ "*" ] }, { "Sid": "EKSAuthAccess", "Effect": "Allow", "Action": [ "eks-auth:AssumeRoleForPodIdentity" ], "Resource": [ "*" ] }, { "Sid": "EKSAccess", "Effect": "Allow", "Action": [ "eks:AssociateAccessPolicy", "eks:Describe*", "eks:List*", "eks:AccessKubernetesApi" ], "Resource": [ "*" ] }, { "Sid": "ApiGatewayAccess", "Effect": "Allow", "Action": [ "apigateway:POST", "apigateway:GET", "apigateway:PUT", "apigateway:PATCH", "apigateway:DELETE", "apigateway:UpdateRestApiPolicy" ], "Resource": [ "arn:aws:apigateway:*::/vpclinks", "arn:aws:apigateway:*::/vpclinks/*", "arn:aws:apigateway:*::/restapis", "arn:aws:apigateway:*::/restapis/*" ] }, { "Sid": "ElasticLoadBalancingAccess", "Effect": "Allow", "Action": [ "elasticloadbalancing:CreateLoadBalancer", "elasticloadbalancing:DescribeLoadBalancers", "elasticloadbalancing:DescribeLoadBalancerAttributes", "elasticloadbalancing:DescribeListeners", "elasticloadbalancing:DescribeListenerCertificates", "elasticloadbalancing:DescribeSSLPolicies", "elasticloadbalancing:DescribeRules", "elasticloadbalancing:DescribeTargetGroups", "elasticloadbalancing:DescribeTargetGroupAttributes", "elasticloadbalancing:DescribeTargetHealth", "elasticloadbalancing:DescribeTags", "elasticloadbalancing:DescribeTrustStores", "elasticloadbalancing:DescribeListenerAttributes" ], "Resource": [ "*" ] }, { "Sid": "SageMakerAccess", "Effect": "Allow", "Action": [ "sagemaker:*" ], "Resource": [ "*" ] }, { "Sid": "AllowPassRoleToSageMaker", "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": "arn:aws:iam::*:role/*", "Condition": { "StringEquals": { "iam:PassedToService": "sagemaker.amazonaws.com" } } }, { "Sid": "AcmAccess", "Effect": "Allow", "Action": [ "acm:ImportCertificate", "acm:DeleteCertificate" ], "Resource": [ "*" ] } ] } EOF -
建立推論運算子的執行角色。
aws iam create-policy --policy-name $HYPERPOD_INFERENCE_ACCESS_POLICY_NAME --policy-document file://permission-policy.json export policy_arn="arn:aws:iam::${ACCOUNT_ID}:policy/$HYPERPOD_INFERENCE_ACCESS_POLICY_NAME"aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json aws iam put-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-name InferenceOperatorInlinePolicy --policy-document file://permission-policy.json -
下載並建立 AWS Load Balancer控制器所需的 IAM 政策,以在 EKS 叢集中管理 Application Load Balancer 和 Network Load Balancer。
%%bash -x export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json aws iam create-policy --policy-name $ALBController_IAM_POLICY_NAME --policy-document file://AWSLoadBalancerControllerIAMPolicy.json -
建立將 Kubernetes 服務帳戶與 IAM 政策連結的 IAM 服務帳戶,讓 AWS Load Balancer控制器能夠透過 IRSA (服務帳戶的 IAM 角色) 取得必要的 AWS 許可。
%%bash -x export ALB_POLICY_ARN="arn:aws:iam::$ACCOUNT_ID:policy/$ALBController_IAM_POLICY_NAME" # Create IAM service account with gathered values eksctl create iamserviceaccount \ --approve \ --override-existing-serviceaccounts \ --name=aws-load-balancer-controller \ --namespace=kube-system \ --cluster=$EKS_CLUSTER_NAME \ --attach-policy-arn=$ALB_POLICY_ARN \ --region=$REGION # Print the values for verification echo "Cluster Name: $EKS_CLUSTER_NAME" echo "Region: $REGION" echo "Policy ARN: $ALB_POLICY_ARN" -
將標籤 (
kubernetes.io.role/elb) 套用至 Amazon EKS 叢集中的所有子網路 (公有和私有)。export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text) # Add Tags aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | \ xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1 # Verify Tags are added aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text -
為 KEDA 和 Cert Manager 建立命名空間。
kubectl create namespace keda kubectl create namespace cert-manager -
建立 Amazon S3 VPC 端點。
aws ec2 create-vpc-endpoint \ --vpc-id ${VPC_ID} \ --vpc-endpoint-type Gateway \ --service-name "com.amazonaws.${REGION}.s3" \ --route-table-ids $(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ') -
設定 S3 儲存體存取:
-
建立 IAM 政策,授予使用適用於 Amazon S3 的掛載點所需的 S3 許可,讓檔案系統可以從叢集內存取 S3 儲存貯體。
%%bash -x export S3_CSI_BUCKET_NAME=“<bucketname_for_mounting_through_filesystem>” cat <<EOF> s3accesspolicy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "MountpointAccess", "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetObject", "s3:PutObject", "s3:AbortMultipartUpload", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::${S3_CSI_BUCKET_NAME}", "arn:aws:s3:::${S3_CSI_BUCKET_NAME}/*" ] } ] } EOF aws iam create-policy \ --policy-name S3MountpointAccessPolicy \ --policy-document file://s3accesspolicy.json cat <<EOF> s3accesstrustpolicy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:kube-system:${s3-csi-driver-sa}" } } } ] } EOF aws iam create-role --role-name $S3_CSI_ROLE_NAME --assume-role-policy-document file://s3accesstrustpolicy.json aws iam attach-role-policy --role-name $S3_CSI_ROLE_NAME --policy-arn "arn:aws:iam::$ACCOUNT_ID:policy/S3MountpointAccessPolicy" -
(選用) 為 Amazon S3 CSI 驅動程式建立 IAM 服務帳戶。Amazon S3 CSI 驅動程式需要具有適當許可的 IAM 服務帳戶,才能將 S3 儲存貯體掛載為 Amazon EKS 叢集中的持久性磁碟區。此步驟會使用必要的 S3 存取政策建立必要的 IAM 角色和 Kubernetes 服務帳戶。
%%bash -x export S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$REGION" export S3_CSI_POLICY_ARN=$(aws iam list-policies --query 'Policies[?PolicyName==`S3MountpointAccessPolicy`]' | jq '.[0].Arn' | tr -d '"') eksctl create iamserviceaccount \ --name s3-csi-driver-sa \ --namespace kube-system \ --cluster $EKS_CLUSTER_NAME \ --attach-policy-arn $S3_CSI_POLICY_ARN \ --approve \ --role-name $S3_CSI_ROLE_NAME \ --region $REGION kubectl label serviceaccount s3-csi-driver-sa app.kubernetes.io/component=csi-driver app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver app.kubernetes.io/managed-by=EKS app.kubernetes.io/name=aws-mountpoint-s3-csi-driver -n kube-system --overwrite -
(選用) 安裝 Amazon S3 CSI 驅動程式附加元件。此驅動程式可讓您的 Pod 將 S3 儲存貯體掛載為持久性磁碟區,讓您可從 Kubernetes 工作負載內直接存取 S3 儲存體。
%%bash -x export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME --query 'Role.Arn' --output text) eksctl create addon --name aws-mountpoint-s3-csi-driver --cluster $EKS_CLUSTER_NAME --service-account-role-arn $S3_CSI_ROLE_ARN --force -
(選用) 為 S3 儲存體建立持久性磁碟區宣告 (PVC)。此 PVC 可讓您的 Pod 請求和使用 S3 儲存體,好像其是傳統檔案系統一樣。
%%bash -x cat <<EOF> pvc_s3.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: s3-claim spec: accessModes: - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany storageClassName: "" # required for static provisioning resources: requests: storage: 1200Gi # ignored, required volumeName: s3-pv EOF kubectl apply -f pvc_s3.yaml
-
-
(選用) 設定 FSx 儲存體存取。為 Amazon FSx CSI 驅動程式建立 IAM 服務帳戶。FSx CSI 驅動程式將使用此服務帳戶,代表您的叢集與 Amazon FSx 服務互動。
%%bash -x eksctl create iamserviceaccount \ --name fsx-csi-controller-sa \ --namespace kube-system \ --cluster $EKS_CLUSTER_NAME \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \ --approve \ --role-name FSXLCSI-${EKS_CLUSTER_NAME}-${REGION} \ --region $REGION
建立 KEDA 運算子角色
-
建立信任政策和許可政策。
# Create trust policy cat <<EOF > /tmp/keda-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:kube-system:keda-operator", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy cat <<EOF > /tmp/keda-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "aps:QueryMetrics", "aps:GetLabels", "aps:GetSeries", "aps:GetMetricMetadata" ], "Resource": "*" } ] } EOF # Create the role aws iam create-role \ --role-name keda-operator-role \ --assume-role-policy-document file:///tmp/keda-trust-policy.json # Create the policy KEDA_POLICY_ARN=$(aws iam create-policy \ --policy-name KedaOperatorPolicy \ --policy-document file:///tmp/keda-policy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name keda-operator-role \ --policy-arn $KEDA_POLICY_ARN -
如果您使用的是門控模型,請建立 IAM 角色來存取門控模型。
-
建立 IAM 政策。
%%bash -s $REGION cat <<EOF> /tmp/presignedurl-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "CreatePresignedUrlAccess", "Effect": "Allow", "Action": [ "sagemaker:CreateHubContentPresignedUrls" ], "Resource": [ "arn:aws:sagemaker:$1:aws:hub/SageMakerPublicHub", "arn:aws:sagemaker:$1:aws:hub-content/SageMakerPublicHub/*/*" ] } ] } EOF aws iam create-policy --policy-name PresignedUrlAccessPolicy --policy-document file:///tmp/presignedurl-policy.json JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}" cat <<EOF > /tmp/trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-controller-manager", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } }, { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF -
建立 IAM 角色。
# Create the role using existing trust policy aws iam create-role \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --assume-role-policy-document file:///tmp/trust-policy.json # Attach the existing PresignedUrlAccessPolicy to the role aws iam attach-role-policy \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/PresignedUrlAccessPolicyJUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0] !echo $JUMPSTART_GATED_ROLE_ARN -
將
SageMakerFullAccess政策新增至執行角色。aws iam attach-role-policy --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --policy-arn=arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
-
安裝推論運算子
-
安裝 HyperPod 推論運算子。此步驟會收集必要的 AWS 資源識別碼,並使用適當的組態參數產生 Helm 安裝命令。
從 https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart
存取 Helm Chart。 git clone https://github.com/aws/sagemaker-hyperpod-cli cd sagemaker-hyperpod-cli cd helm_chart/HyperPodHelmChart helm dependencies update charts/inference-operator%%bash -x HYPERPOD_INFERENCE_ROLE_ARN=$(aws iam get-role --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text) echo $HYPERPOD_INFERENCE_ROLE_ARN S3_CSI_ROLE_ARN=$(aws iam get-role --role-name=$S3_CSI_ROLE_NAME --query "Role.Arn" --output text) echo $S3_CSI_ROLE_ARN HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --query "ClusterArn") # Verify values echo "Cluster Name: $EKS_CLUSTER_NAME" echo "Execution Role: $HYPERPOD_INFERENCE_ROLE_ARN" echo "Hyperpod ARN: $HYPERPOD_CLUSTER_ARN" # Run the the HyperPod inference operator installation. helm install hyperpod-inference-operator charts/inference-operator \ -n kube-system \ --set region=$REGION \ --set eksClusterName=$EKS_CLUSTER_NAME \ --set hyperpodClusterArn=$HYPERPOD_CLUSTER_ARN \ --set executionRoleArn=$HYPERPOD_INFERENCE_ROLE_ARN \ --set s3.serviceAccountRoleArn=$S3_CSI_ROLE_ARN \ --set s3.node.serviceAccount.create=false \ --set keda.podIdentity.aws.irsa.roleArn="arn:aws:iam::$ACCOUNT_ID:role/keda-operator-role" \ --set tlsCertificateS3Bucket="s3://$BUCKET_NAME" \ --set alb.region=$REGION \ --set alb.clusterName=$EKS_CLUSTER_NAME \ --set alb.vpcId=$VPC_ID # For JumpStart Gated Model usage, Add # --set jumpstartGatedModelDownloadRoleArn=$UMPSTART_GATED_ROLE_ARN -
設定服務帳戶註釋進行 IAM 整合。此註釋可讓運算子的服務帳戶取得必要的 IAM 許可,以管理推論端點並與 AWS 服務互動。
%%bash -x EKS_CLUSTER_ROLE_NAME=$(echo $EKS_CLUSTER_ROLE | sed 's/.*\///') # Annotate service account kubectl annotate serviceaccount hyperpod-inference-operator-controller-manager \ -n hyperpod-inference-system \ eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/${EKS_CLUSTER_ROLE_NAME} \ --overwrite
驗證推論運算子是否運作中
請依照下列步驟,透過部署和測試簡單的模型,確認您的推論運算子安裝是否正常運作。
部署測試模型以驗證運算子
-
建立模型部署組態檔案。這會建立 Kubernetes 資訊清單檔案,為 HyperPod 推論運算子定義 JumpStart 模型部署。
cat <<EOF>> simple_model_install.yaml --- apiVersion: inference.sagemaker.aws.amazon.com/v1 kind: JumpStartModel metadata: name: testing-deployment-bert namespace: default spec: model: modelId: "huggingface-eqa-bert-base-cased" sageMakerEndpoint: name: "hp-inf-ep-for-testing" server: instanceType: "ml.c5.2xlarge" environmentVariables: - name: SAMPLE_ENV_VAR value: "sample_value" maxDeployTimeInSeconds: 1800 EOF -
部署模型並清除組態檔案。
kubectl create -f simple_model_install.yaml rm -f simple_model_install.yaml -
驗證服務帳戶組態,以確保操作員可以取得 AWS 許可。
# Get the service account details kubectl get serviceaccount -n hyperpod-inference-system # Check if the service account has the AWS annotations kubectl describe serviceaccount hyperpod-inference-operator-controller-manager -n hyperpod-inference-system
設定部署設定 (如果使用 Studio UI)
-
在部署設定下檢閱建議的執行個體類型。
-
如果修改執行個體類型,請確保與您的 HyperPod 叢集相容。如果無法使用相容的執行個體,請聯絡您的管理員。
-
對於已啟用 MIG 的 GPU 分割執行個體,請從可用的 MIG 設定檔中選取適當的 GPU 分割區,以最佳化 GPU 使用率。如需詳細資訊,請參閱在 Amazon SageMaker HyperPod 中使用 GPU 分割區。
-
如果使用任務控管,請設定模型部署先佔功能的優先順序設定。
-
輸入管理員提供的命名空間。如有需要,請聯絡您的管理員以取得正確的命名空間。
(選用) 透過 SageMaker AI Studio Classic 中的 JumpStart UI 設定使用者存取
如需為 Studio Classic 使用者設定 SageMaker HyperPod 存取,以及為資料科學家使用者設定精細 Kubernetes RBAC 許可的更多背景,請參閱在 Studio 中設定 Amazon EKS 叢集和設定 Kubernetes 角色型存取控制。
-
識別資料科學家使用者將用來從 SageMaker AI Studio Classic 管理模型並將其部署至 SageMaker HyperPod 的 IAM 角色。這通常是 Studio Classic 使用者的使用者設定檔執行角色或網域執行角色。
%%bash -x export DATASCIENTIST_ROLE_NAME="<Execution Role Name used in SageMaker Studio Classic>" export DATASCIENTIST_POLICY_NAME="HyperPodUIAccessPolicy" export EKS_CLUSTER_ARN=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text) export DATASCIENTIST_HYPERPOD_NAMESPACE="team-namespace" -
連接啟用模型部署存取的身分政策。
%%bash -x # Create access policy cat << EOF > hyperpod-deployment-ui-access-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "DescribeHyerpodClusterPermissions", "Effect": "Allow", "Action": [ "sagemaker:DescribeCluster" ], "Resource": "$HYPERPOD_CLUSTER_ARN" }, { "Sid": "UseEksClusterPermissions", "Effect": "Allow", "Action": [ "eks:DescribeCluster", "eks:AccessKubernetesApi", "eks:MutateViaKubernetesApi", "eks:DescribeAddon" ], "Resource": "$EKS_CLUSTER_ARN" }, { "Sid": "ListPermission", "Effect": "Allow", "Action": [ "sagemaker:ListClusters", "sagemaker:ListEndpoints" ], "Resource": "*" }, { "Sid": "SageMakerEndpointAccess", "Effect": "Allow", "Action": [ "sagemaker:DescribeEndpoint", "sagemaker:InvokeEndpoint" ], "Resource": "arn:aws:sagemaker:$REGION:$ACCOUNT_ID:endpoint/*" } ] } EOF aws iam put-role-policy --role-name DATASCIENTIST_ROLE_NAME --policy-name HyperPodDeploymentUIAccessInlinePolicy --policy-document file://hyperpod-deployment-ui-access-policy.json -
建立一個 EKS 存取項目,讓使用者將其對應至 kubernetes 群組。
%%bash -x aws eks create-access-entry --cluster-name $EKS_CLUSTER_NAME \ --principal-arn "arn:aws:iam::$ACCOUNT_ID:role/$DATASCIENTIST_ROLE_NAME" \ --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]' -
為使用者建立 Kubernetes RBAC 政策。
%%bash -x cat << EOF > cluster_level_config.yaml kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: hyperpod-scientist-user-cluster-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["list"] - apiGroups: [""] resources: ["nodes"] verbs: ["list"] - apiGroups: [""] resources: ["namespaces"] verbs: ["list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: hyperpod-scientist-user-cluster-role-binding subjects: - kind: Group name: hyperpod-scientist-user-cluster-level apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: hyperpod-scientist-user-cluster-role apiGroup: rbac.authorization.k8s.io EOF kubectl apply -f cluster_level_config.yaml cat << EOF > namespace_level_role.yaml kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE name: hyperpod-scientist-user-namespace-level-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["create", "get"] - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/exec"] verbs: ["get", "create"] - apiGroups: ["kubeflow.org"] resources: ["pytorchjobs", "pytorchjobs/status"] verbs: ["get", "list", "create", "delete", "update", "describe"] - apiGroups: [""] resources: ["configmaps"] verbs: ["create", "update", "get", "list", "delete"] - apiGroups: [""] resources: ["secrets"] verbs: ["create", "get", "list", "delete"] - apiGroups: [ "inference.sagemaker.aws.amazon.com" ] resources: [ "inferenceendpointconfig", "inferenceendpoint", "jumpstartmodel" ] verbs: [ "get", "list", "create", "delete", "update", "describe" ] - apiGroups: [ "autoscaling" ] resources: [ "horizontalpodautoscalers" ] verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE name: hyperpod-scientist-user-namespace-level-role-binding subjects: - kind: Group name: hyperpod-scientist-user-namespace-level apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: hyperpod-scientist-user-namespace-level-role apiGroup: rbac.authorization.k8s.io EOF kubectl apply -f namespace_level_role.yaml