設定 HyperPod 叢集以進行模型部署 - Amazon SageMaker AI

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

設定 HyperPod 叢集以進行模型部署

本指南說明如何在 Amazon SageMaker HyperPod 叢集上啟用推論功能。您將設定機器學習工程師部署和管理推論端點所需的基礎設施、許可和運算子。

注意

若要建立已預先安裝推論運算子的叢集,請參閱 建立 EKS 協作的 SageMaker HyperPod 叢集。若要在現有叢集上安裝推論運算子,請繼續執行下列程序。

您可以使用 SageMaker AI 主控台安裝推論運算子,以獲得簡化的體驗,或使用 AWS CLI 進行更多控制。本指南涵蓋兩種安裝方法。

方法 1:透過 SageMaker AI 主控台安裝 HyperPod 推論附加元件 (建議)

SageMaker AI 主控台提供兩種安裝選項的最簡化體驗:

  • 快速安裝:使用最佳化預設值自動建立所有必要的資源,包括 IAM 角色、Amazon S3 儲存貯體和相依性附加元件。將建立新的 Studio 網域,其中包含將 JumpStart 模型部署至相關叢集所需的許可。此選項非常適合以最少的組態決策快速入門。

  • 自訂安裝:提供彈性來指定現有資源或自訂組態,同時維持一鍵式體驗。客戶可以選擇根據組織需求重複使用現有的 IAM 角色、Amazon S3 儲存貯體或相依性附加元件。

先決條件

  • 具有 Amazon EKS 協同運作的現有 HyperPod 叢集

  • Amazon EKS 叢集管理的 IAM 許可

  • 為叢集存取設定的 kubectl

安裝步驟

  1. 導覽至 SageMaker AI 主控台,然後前往 HyperPod 叢集叢集管理

  2. 選取您要安裝推論運算子的叢集。

  3. 導覽至推論索引標籤。針對自動設定選取快速安裝,或針對組態彈性選取自訂安裝

  4. 如果選擇自訂安裝,請視需要指定現有的資源或自訂設定。

  5. 按一下安裝以開始自動安裝程序。

  6. 透過 主控台或執行下列命令來驗證安裝狀態:

    kubectl get pods -n hyperpod-inference-system
    aws eks describe-addon --cluster-name CLUSTER-NAME --addon-name amazon-sagemaker-hyperpod-inference --region REGION

成功安裝附加元件後,您可以使用模型部署文件部署模型,或導覽至 驗證推論運算子是否運作中

方法 2:使用 AWS CLI 安裝推論運算子

AWS CLI 安裝方法提供更多對安裝程序的控制,並適用於自動化和進階組態。

先決條件

推論運算子可讓您在 Amazon EKS 叢集上部署和管理機器學習推論端點。在安裝之前,請確定您的叢集具有所需的安全組態和支援基礎設施。完成下列步驟以設定 IAM 角色、安裝 AWS Load Balancer控制器、設定 Amazon S3 和 Amazon FSx CSI 驅動程式,以及部署 KEDA 和 cert-manager:

注意

或者,您可以使用 CloudFormation 範本來自動化先決條件設定。如需詳細資訊,請參閱使用 CloudFormation 範本建立先決條件堆疊

連接至您的叢集並設定環境變數

在繼續之前,請確認您的 AWS 登入資料已正確設定並具有必要的許可。使用具有管理員權限的 IAM 主體和對 Amazon EKS 叢集的叢集管理員存取權來執行下列步驟。請確定您已使用 建立 HyperPod 叢集使用 Amazon EKS 協同運作建立 SageMaker HyperPod 叢集。安裝 helm、eksctl 和 kubectl 命令列公用程式。

對於 Amazon EKS 叢集的 Kubernetes 管理存取權,請開啟 Amazon EKS 主控台,然後選取您的叢集。在存取索引標籤中,選取 IAM 存取項目。如果您的 IAM 主體沒有項目,請選取建立存取項目。選取所需的 IAM 主體,並將 AmazonEKSClusterAdminPolicy 與其建立關聯。

  1. 設定 kubectl 以連線至 Amazon EKS 叢集協調的新建立 HyperPod 叢集。指定區域和 HyperPod 叢集名稱。

    export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name> export REGION=<region> # S3 bucket where tls certificates will be uploaded export BUCKET_NAME="hyperpod-tls-<your-bucket-suffix>" # Bucket should have prefix: hyperpod-tls-* export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text | \ cut -d'/' -f2) aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
    注意

    如果使用的自訂儲存貯體名稱不是以 開頭hyperpod-tls-,請將下列政策連接至您的執行角色:

    { "Version": "2012-10-17", "Statement": [ { "Sid": "TLSBucketDeleteObjectsPermission", "Effect": "Allow", "Action": ["s3:DeleteObject"], "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"], "Condition": { "StringEquals": { "aws:ResourceAccount": "${aws:PrincipalAccount}" } } }, { "Sid": "TLSBucketGetObjectAccess", "Effect": "Allow", "Action": ["s3:GetObject"], "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"] }, { "Sid": "TLSBucketPutObjectAccess", "Effect": "Allow", "Action": ["s3:PutObject", "s3:PutObjectTagging"], "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"], "Condition": { "StringEquals": { "aws:ResourceAccount": "${aws:PrincipalAccount}" } } } ] }
  2. 設定預設 env 變數。

    HYPERPOD_INFERENCE_ROLE_NAME="SageMakerHyperPodInference-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_NAMESPACE="hyperpod-inference-system"
  3. 從叢集 ARN 擷取 Amazon EKS 叢集名稱、更新本機 kubeconfig,並透過列出跨命名空間的所有 Pod 來驗證連線。

    kubectl get pods --all-namespaces
  4. (選用) 安裝 NVIDIA 裝置外掛程式,以在叢集上啟用 GPU 支援。

    # Install nvidia device plugin kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml # Verify that GPUs are visible to k8s kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu

為推論運算子設定 IAM 角色

  1. 收集必要的 AWS 資源識別符和 ARNs,以設定 Amazon EKS、SageMaker AI 和 IAM 元件之間的服務整合。

    %%bash -x export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text) export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5) export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
  2. 將 IAM OIDCidentity 提供者與您的叢集建立關聯。

    eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
  3. 建立 HyperPod 推論運算子 IAM 角色所需的信任政策。這些政策可在 Amazon EKS、SageMaker AI 和其他 AWS 服務之間啟用安全的跨服務通訊。

    %%bash -x # Create trust policy JSON cat << EOF > trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "sagemaker.amazonaws.com" ] }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager" } } } ] } EOF
  4. 建立推論運算子的執行角色。

    aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json aws iam attach-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodInferenceAccess
  5. 建立推論運算子資源的命名空間

    kubectl create namespace $HYPERPOD_INFERENCE_NAMESPACE

建立 ALB 控制器角色

  1. 建立信任政策和許可政策。

    # Create trust policy cat <<EOF > /tmp/alb-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:aws-load-balancer-controller", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json # Create the role aws iam create-role \ --role-name alb-role \ --assume-role-policy-document file:///tmp/alb-trust-policy.json # Create the policy ALB_POLICY_ARN=$(aws iam create-policy \ --policy-name $ALBController_IAM_POLICY_NAME \ --policy-document file://AWSLoadBalancerControllerIAMPolicy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name alb-role \ --policy-arn $ALB_POLICY_ARN
  2. 將標籤 (kubernetes.io.role/elb) 套用至 Amazon EKS 叢集中的所有子網路 (公有和私有)。

    export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text) # Add Tags aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | \ xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1 # Verify Tags are added aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
  3. 建立 Amazon S3 VPC 端點。

    aws ec2 create-vpc-endpoint \ --region ${REGION} \ --vpc-id ${VPC_ID} \ --vpc-endpoint-type Gateway \ --service-name "com.amazonaws.${REGION}.s3" \ --route-table-ids $(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')

建立 KEDA 運算子角色

  1. 建立信任政策和許可政策。

    # Create trust policy cat <<EOF > /tmp/keda-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:keda-operator", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy cat <<EOF > /tmp/keda-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "aps:QueryMetrics", "aps:GetLabels", "aps:GetSeries", "aps:GetMetricMetadata" ], "Resource": "*" } ] } EOF # Create the role aws iam create-role \ --role-name keda-operator-role \ --assume-role-policy-document file:///tmp/keda-trust-policy.json # Create the policy KEDA_POLICY_ARN=$(aws iam create-policy \ --policy-name KedaOperatorPolicy \ --policy-document file:///tmp/keda-policy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name keda-operator-role \ --policy-arn $KEDA_POLICY_ARN
  2. 如果您使用的是門控模型,請建立 IAM 角色來存取門控模型。

    1. 建立 IAM 政策。

      %%bash -s $REGION JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}" cat <<EOF > /tmp/trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account*", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } }, { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF
    2. 建立 IAM 角色。

      # Create the role using existing trust policy aws iam create-role \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --assume-role-policy-document file:///tmp/trust-policy.json aws iam attach-role-policy \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodGatedModelAccess
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0] !echo $JUMPSTART_GATED_ROLE_ARN

安裝相依性 EKS 附加元件

安裝推論運算子之前,您必須在叢集上安裝下列必要的 EKS 附加元件。如果缺少任何這些相依性,推論運算子將無法安裝 。每個附加元件都有與推論附加元件相容的最低版本需求。

重要

在嘗試安裝推論運算子之前,先安裝所有相依性附加元件。缺少相依性會導致安裝失敗,並顯示特定錯誤訊息。

必要的附加元件

  1. Amazon S3 掛載點 CSI 驅動程式 (最低版本:v1.14.1-eksbuild.1)

    在推論工作負載中將 S3 儲存貯體掛載為持久性磁碟區時需要。

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name aws-mountpoint-s3-csi-driver \ --region $REGION \ --service-account-role-arn $S3_CSI_ROLE_ARN

    如需包含必要 IAM 許可的詳細安裝說明,請參閱 Amazon S3 CSI 驅動程式掛載點

  2. Amazon FSx CSI 驅動程式 (最低版本:v1.6.0-eksbuild.1)

    為高效能模型儲存體掛載 FSx 檔案系統時需要。

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name aws-fsx-csi-driver \ --region $REGION \ --service-account-role-arn $FSX_CSI_ROLE_ARN

    如需包含必要 IAM 許可的詳細安裝說明,請參閱 Amazon FSx for Lustre CSI 驅動程式

  3. Metrics Server (最低版本:v0.7.2-eksbuild.4)

    自動調整規模功能和資源指標集合的必要項目。

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name metrics-server \ --region $REGION

    如需詳細安裝指示,請參閱指標伺服器

  4. Cert Manager (最低版本:v1.18.2-eksbuild.2)

    安全推論端點的 TLS 憑證管理需要。

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name cert-manager \ --region $REGION

    如需詳細安裝說明,請參閱 cert-manager

驗證附加元件安裝

安裝必要的附加元件後,請確認它們執行正確:

# Check add-on status aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION # Verify pods are running kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)" kubectl get pods -n cert-manager

所有附加元件都應該顯示「ACTIVE」狀態,而且所有 Pod 都應該處於「執行中」狀態,才能繼續安裝推論運算子。

注意

如果您使用快速設定或自訂設定選項建立 HyperPod 叢集,則可能已安裝 FSx CSI 驅動程式和 Cert Manager。使用上述命令驗證其是否存在。

使用 EKS 附加元件安裝推論運算子

EKS 附加元件安裝方法提供具有自動更新和整合相依性驗證的受管體驗。這是安裝推論運算子的建議方法。

安裝推論運算子附加元件
  1. 收集所有必要ARNs 並建立組態檔案,以準備附加元件組態:

    # Gather required ARNs export EXECUTION_ROLE_ARN=$(aws iam get-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text) export HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --region $REGION --query "ClusterArn" --output text) export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text) export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text) # Verify all ARNs are set correctly echo "Execution Role ARN: $EXECUTION_ROLE_ARN" echo "HyperPod Cluster ARN: $HYPERPOD_CLUSTER_ARN" echo "KEDA Role ARN: $KEDA_ROLE_ARN" echo "ALB Role ARN: $ALB_ROLE_ARN" echo "TLS S3 Bucket: $BUCKET_NAME"
  2. 使用所有必要設定建立附加元件組態檔案:

    cat > addon-config.json << EOF { "executionRoleArn": "$EXECUTION_ROLE_ARN", "tlsCertificateS3Bucket": "$BUCKET_NAME", "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN", "jumpstartGatedModelDownloadRoleArn": "$JUMPSTART_GATED_ROLE_ARN", "alb": { "serviceAccount": { "create": true, "roleArn": "$ALB_ROLE_ARN" } }, "keda": { "auth": { "aws": { "irsa": { "roleArn": "$KEDA_ROLE_ARN" } } } } } EOF # Verify the configuration file cat addon-config.json
  3. 安裝推論運算子附加元件 (最低版本:v1.0.0-eksbuild.1):

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --configuration-values file://addon-config.json \ --region $REGION
  4. 監控安裝進度並驗證是否成功完成:

    # Check installation status (repeat until status shows "ACTIVE") aws eks describe-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION \ --query "addon.{Status:status,Health:health}" \ --output table # Verify pods are running kubectl get pods -n hyperpod-inference-system # Check operator logs for any issues kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50

如需安裝問題的詳細疑難排解,請參閱 HyperPod 推論故障診斷

若要驗證推論運算子是否正常運作,請繼續 驗證推論運算子是否運作中

使用 CloudFormation 範本建立先決條件堆疊

除了手動設定先決條件外,您也可以使用 CloudFormation 範本自動建立推論運算子所需的 IAM 角色和政策。

  1. 設定輸入變數。將預留位置值取代為您自己的值:

    #!/bin/bash set -e # ===== INPUT VARIABLES ===== HP_CLUSTER_NAME="my-hyperpod-cluster" # Replace with your HyperPod cluster name REGION="us-east-1" # Replace with your AWS region PREFIX="my-prefix" # Replace with your resource prefix SHORT_PREFIX="12a34d56" # Replace with your short prefix (maximum 8 characters) CREATE_DOMAIN="true" # Set to "false" if you don't need a SageMaker Studio domain STACK_NAME="hyperpod-inference-prerequisites" # Replace with your stack name TEMPLATE_URL="https://aws-sagemaker-hyperpod-cluster-setup-${REGION}-prod.s3.${REGION}.amazonaws.com/templates/main-stack-inference-operator-addon-template.yaml"
  2. 衍生叢集和網路資訊:

    # ===== DERIVE EKS CLUSTER NAME ===== EKS_CLUSTER_NAME=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'Orchestrator.Eks.ClusterArn' --output text | awk -F'/' '{print $NF}') echo "EKS_CLUSTER_NAME=$EKS_CLUSTER_NAME" # ===== GET VPC AND OIDC ===== VPC_ID=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.resourcesVpcConfig.vpcId' --output text) echo "VPC_ID=$VPC_ID" OIDC_PROVIDER=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.identity.oidc.issuer' --output text | sed 's|https://||') echo "OIDC_PROVIDER=$OIDC_PROVIDER" # ===== GET PRIVATE ROUTE TABLES ===== ALL_ROUTE_TABLES=$(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'RouteTables[].RouteTableId' --output text) EKS_PRIVATE_ROUTE_TABLES="" for rtb in $ALL_ROUTE_TABLES; do HAS_IGW=$(aws ec2 describe-route-tables --region $REGION --route-table-ids $rtb --query 'RouteTables[0].Routes[?GatewayId && starts_with(GatewayId, `igw-`)]' --output text 2>/dev/null) if [ -z "$HAS_IGW" ]; then EKS_PRIVATE_ROUTE_TABLES="${EKS_PRIVATE_ROUTE_TABLES:+$EKS_PRIVATE_ROUTE_TABLES,}$rtb" fi done echo "EKS_PRIVATE_ROUTE_TABLES=$EKS_PRIVATE_ROUTE_TABLES" # ===== CHECK S3 VPC ENDPOINT ===== S3_ENDPOINT_EXISTS=$(aws ec2 describe-vpc-endpoints --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" "Name=service-name,Values=com.amazonaws.$REGION.s3" --query 'VpcEndpoints[0].VpcEndpointId' --output text) CREATE_S3_ENDPOINT_STACK=$([ "$S3_ENDPOINT_EXISTS" == "None" ] && echo "true" || echo "false") echo "CREATE_S3_ENDPOINT_STACK=$CREATE_S3_ENDPOINT_STACK" # ===== GET HYPERPOD DETAILS ===== HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'ClusterArn' --output text) echo "HYPERPOD_CLUSTER_ARN=$HYPERPOD_CLUSTER_ARN" # ===== GET DEFAULT VPC FOR DOMAIN ===== DOMAIN_VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=isDefault,Values=true" --query 'Vpcs[0].VpcId' --output text) echo "DOMAIN_VPC_ID=$DOMAIN_VPC_ID" DOMAIN_SUBNET_IDS=$(aws ec2 describe-subnets --region $REGION --filters "Name=vpc-id,Values=$DOMAIN_VPC_ID" --query 'Subnets[0].SubnetId' --output text) echo "DOMAIN_SUBNET_IDS=$DOMAIN_SUBNET_IDS" # ===== GET INSTANCE GROUPS ===== INSTANCE_GROUPS=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'InstanceGroups[].InstanceGroupName' --output json | python3 -c "import sys, json; groups = json.load(sys.stdin); print('[' + ','.join([f'\\\\\\\"' + g + '\\\\\\\"' for g in groups]) + ']')") echo "INSTANCE_GROUPS=$INSTANCE_GROUPS"
  3. 建立參數檔案並部署堆疊:

    # ===== CREATE PARAMETERS JSON ===== cat > /tmp/cfn-params.json << EOF [ {"ParameterKey":"ResourceNamePrefix","ParameterValue":"$PREFIX"}, {"ParameterKey":"ResourceNameShortPrefix","ParameterValue":"$SHORT_PREFIX"}, {"ParameterKey":"VpcId","ParameterValue":"$VPC_ID"}, {"ParameterKey":"EksPrivateRouteTableIds","ParameterValue":"$EKS_PRIVATE_ROUTE_TABLES"}, {"ParameterKey":"EKSClusterName","ParameterValue":"$EKS_CLUSTER_NAME"}, {"ParameterKey":"OIDCProviderURLWithoutProtocol","ParameterValue":"$OIDC_PROVIDER"}, {"ParameterKey":"HyperPodClusterArn","ParameterValue":"$HYPERPOD_CLUSTER_ARN"}, {"ParameterKey":"HyperPodClusterName","ParameterValue":"$HP_CLUSTER_NAME"}, {"ParameterKey":"CreateDomain","ParameterValue":"$CREATE_DOMAIN"}, {"ParameterKey":"DomainVpcId","ParameterValue":"$DOMAIN_VPC_ID"}, {"ParameterKey":"DomainSubnetIds","ParameterValue":"$DOMAIN_SUBNET_IDS"}, {"ParameterKey":"CreateS3EndpointStack","ParameterValue":"$CREATE_S3_ENDPOINT_STACK"}, {"ParameterKey":"TieredStorageConfig","ParameterValue":"{\"Mode\":\"Enable\",\"InstanceMemoryAllocationPercentage\":20}"}, {"ParameterKey":"TieredKVCacheConfig","ParameterValue":"{\"KVCacheMode\":\"Enable\",\"InstanceGroup\":$INSTANCE_GROUPS,\"NVMeMode\":\"Enable\"}"} ] EOF echo -e "\n===== CREATING CLOUDFORMATION STACK =====" aws cloudformation create-stack \ --region $REGION \ --stack-name $STACK_NAME \ --template-url $TEMPLATE_URL \ --parameters file:///tmp/cfn-params.json \ --capabilities CAPABILITY_NAMED_IAM
  4. 監控堆疊建立狀態:

    aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --region $REGION \ --query 'Stacks[0].StackStatus'
  5. 成功建立堆疊後,請擷取輸出值以用於推論運算子安裝:

    aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --region $REGION \ --query 'Stacks[0].Outputs'

建立 CloudFormation 堆疊之後,請繼續使用 EKS 附加元件安裝推論運算子安裝推論運算子。

方法 3:Helm Chart 安裝

如果您需要更多控制安裝組態,或您的區域無法使用 EKS 附加元件,請使用此方法。

先決條件

在繼續之前,請確認您的 AWS 登入資料已正確設定並具有必要的許可。IAM 主體需要執行下列步驟,該主體具有 Amazon EKS 叢集的管理員權限和叢集管理員存取權。驗證您是否已使用 使用 Amazon EKS 協同運作建立 SageMaker HyperPod 叢集 建立 HyperPod 叢集。驗證您是否已安裝 helm、eksctl 和 kubectl 命令列公用程式。

如需 Amazon EKS 叢集的 Kubernetes 管理存取權,請前往 Amazon EKS 主控台,然後選取您正在使用的叢集。查看存取索引標籤,然後選取 IAM 存取項目。如果您的 IAM 主體沒有項目,請選取建立存取項目。然後選取所需的 IAM 主體,並將 AmazonEKSClusterAdminPolicy 與其建立關聯。

  1. 設定 kubectl 以連線至 Amazon EKS 叢集協調的新建立 HyperPod 叢集。指定區域和 HyperPod 叢集名稱。

    export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name> export REGION=<region> # S3 bucket where tls certificates will be uploaded BUCKET_NAME="<Enter name of your s3 bucket>" # This should be bucket name, not URI export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text | \ cut -d'/' -f2) aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
  2. 設定預設 env 變數。

    LB_CONTROLLER_POLICY_NAME="AWSLoadBalancerControllerIAMPolicy-$HYPERPOD_CLUSTER_NAME" LB_CONTROLLER_ROLE_NAME="aws-load-balancer-controller-$HYPERPOD_CLUSTER_NAME" S3_MOUNT_ACCESS_POLICY_NAME="S3MountpointAccessPolicy-$HYPERPOD_CLUSTER_NAME" S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$HYPERPOD_CLUSTER_NAME" KEDA_OPERATOR_POLICY_NAME="KedaOperatorPolicy-$HYPERPOD_CLUSTER_NAME" KEDA_OPERATOR_ROLE_NAME="keda-operator-role-$HYPERPOD_CLUSTER_NAME" PRESIGNED_URL_ACCESS_POLICY_NAME="PresignedUrlAccessPolicy-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_ACCESS_POLICY_NAME="HyperpodInferenceAccessPolicy-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_ROLE_NAME="HyperpodInferenceRole-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_SA_NAME="hyperpod-inference-operator-controller" HYPERPOD_INFERENCE_SA_NAMESPACE="hyperpod-inference-system" JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-$HYPERPOD_CLUSTER_NAME" FSX_CSI_ROLE_NAME="AmazonEKSFSxLustreCSIDriverFullAccess-$HYPERPOD_CLUSTER_NAME"
  3. 從叢集 ARN 擷取 Amazon EKS 叢集名稱、更新本機 kubeconfig,並透過列出跨命名空間的所有 Pod 來驗證連線。

    kubectl get pods --all-namespaces
  4. (選用) 安裝 NVIDIA 裝置外掛程式,以在叢集上啟用 GPU 支援。

    #Install nvidia device plugin kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml # Verify that GPUs are visible to k8s kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu

準備您的環境以進行推論運算子安裝

  1. 收集必要的 AWS 資源識別符和 ARNs,以設定 Amazon EKS、SageMaker AI 和 IAM 元件之間的服務整合。

    %%bash -x export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text) export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5) export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
  2. 將 IAM OIDCidentity 提供者與您的叢集建立關聯。

    eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
  3. 建立 HyperPod 推論運算子 IAM 角色所需的信任政策和許可政策 JSON 文件。這些政策可在 Amazon EKS、SageMaker AI 和其他 AWS 服務之間啟用安全的跨服務通訊。

    bash # Create trust policy JSON cat << EOF > trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "sagemaker.amazonaws.com" ] }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager" } } } ] } EOF # Create permission policy JSON cat << EOF > permission-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "S3Access", "Effect": "Allow", "Action": [ "s3:Get*", "s3:List*", "s3:Describe*", "s3:PutObject" ], "Resource": [ "*" ] }, { "Sid": "ECRAccess", "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:DescribeImages", "ecr:BatchGetImage", "ecr:GetLifecyclePolicy", "ecr:GetLifecyclePolicyPreview", "ecr:ListTagsForResource", "ecr:DescribeImageScanFindings" ], "Resource": [ "*" ] }, { "Sid": "EC2Access", "Effect": "Allow", "Action": [ "ec2:AssignPrivateIpAddresses", "ec2:AttachNetworkInterface", "ec2:CreateNetworkInterface", "ec2:DeleteNetworkInterface", "ec2:DescribeInstances", "ec2:DescribeTags", "ec2:DescribeNetworkInterfaces", "ec2:DescribeInstanceTypes", "ec2:DescribeSubnets", "ec2:DetachNetworkInterface", "ec2:ModifyNetworkInterfaceAttribute", "ec2:UnassignPrivateIpAddresses", "ec2:CreateTags", "ec2:DescribeInstances", "ec2:DescribeInstanceTypes", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSubnets", "ec2:DescribeVolumes", "ec2:DescribeVolumesModifications", "ec2:DescribeVpcs", "ec2:CreateVpcEndpointServiceConfiguration", "ec2:DeleteVpcEndpointServiceConfigurations", "ec2:DescribeVpcEndpointServiceConfigurations", "ec2:ModifyVpcEndpointServicePermissions" ], "Resource": [ "*" ] }, { "Sid": "EKSAuthAccess", "Effect": "Allow", "Action": [ "eks-auth:AssumeRoleForPodIdentity" ], "Resource": [ "*" ] }, { "Sid": "EKSAccess", "Effect": "Allow", "Action": [ "eks:AssociateAccessPolicy", "eks:Describe*", "eks:List*", "eks:AccessKubernetesApi" ], "Resource": [ "*" ] }, { "Sid": "ApiGatewayAccess", "Effect": "Allow", "Action": [ "apigateway:POST", "apigateway:GET", "apigateway:PUT", "apigateway:PATCH", "apigateway:DELETE", "apigateway:UpdateRestApiPolicy" ], "Resource": [ "arn:aws:apigateway:*::/vpclinks", "arn:aws:apigateway:*::/vpclinks/*", "arn:aws:apigateway:*::/restapis", "arn:aws:apigateway:*::/restapis/*" ] }, { "Sid": "ElasticLoadBalancingAccess", "Effect": "Allow", "Action": [ "elasticloadbalancing:CreateLoadBalancer", "elasticloadbalancing:DescribeLoadBalancers", "elasticloadbalancing:DescribeLoadBalancerAttributes", "elasticloadbalancing:DescribeListeners", "elasticloadbalancing:DescribeListenerCertificates", "elasticloadbalancing:DescribeSSLPolicies", "elasticloadbalancing:DescribeRules", "elasticloadbalancing:DescribeTargetGroups", "elasticloadbalancing:DescribeTargetGroupAttributes", "elasticloadbalancing:DescribeTargetHealth", "elasticloadbalancing:DescribeTags", "elasticloadbalancing:DescribeTrustStores", "elasticloadbalancing:DescribeListenerAttributes" ], "Resource": [ "*" ] }, { "Sid": "SageMakerAccess", "Effect": "Allow", "Action": [ "sagemaker:*" ], "Resource": [ "*" ] }, { "Sid": "AllowPassRoleToSageMaker", "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": "arn:aws:iam::*:role/*", "Condition": { "StringEquals": { "iam:PassedToService": "sagemaker.amazonaws.com" } } }, { "Sid": "AcmAccess", "Effect": "Allow", "Action": [ "acm:ImportCertificate", "acm:DeleteCertificate" ], "Resource": [ "*" ] } ] } EOF
  4. 建立推論運算子的執行角色。

    aws iam create-policy --policy-name $HYPERPOD_INFERENCE_ACCESS_POLICY_NAME --policy-document file://permission-policy.json export policy_arn="arn:aws:iam::${ACCOUNT_ID}:policy/$HYPERPOD_INFERENCE_ACCESS_POLICY_NAME"
    aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json aws iam put-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-name InferenceOperatorInlinePolicy --policy-document file://permission-policy.json
  5. 下載並建立 AWS Load Balancer控制器所需的 IAM 政策,以在 EKS 叢集中管理 Application Load Balancer 和 Network Load Balancer。

    %%bash -x export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json aws iam create-policy --policy-name $ALBController_IAM_POLICY_NAME --policy-document file://AWSLoadBalancerControllerIAMPolicy.json
  6. 建立將 Kubernetes 服務帳戶與 IAM 政策連結的 IAM 服務帳戶,讓 AWS Load Balancer控制器能夠透過 IRSA (服務帳戶的 IAM 角色) 取得必要的 AWS 許可。

    %%bash -x export ALB_POLICY_ARN="arn:aws:iam::$ACCOUNT_ID:policy/$ALBController_IAM_POLICY_NAME" # Create IAM service account with gathered values eksctl create iamserviceaccount \ --approve \ --override-existing-serviceaccounts \ --name=aws-load-balancer-controller \ --namespace=kube-system \ --cluster=$EKS_CLUSTER_NAME \ --attach-policy-arn=$ALB_POLICY_ARN \ --region=$REGION # Print the values for verification echo "Cluster Name: $EKS_CLUSTER_NAME" echo "Region: $REGION" echo "Policy ARN: $ALB_POLICY_ARN"
  7. 將標籤 (kubernetes.io.role/elb) 套用至 Amazon EKS 叢集中的所有子網路 (公有和私有)。

    export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text) # Add Tags aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | \ xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1 # Verify Tags are added aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
  8. 為 KEDA 和 Cert Manager 建立命名空間。

    kubectl create namespace keda kubectl create namespace cert-manager
  9. 建立 Amazon S3 VPC 端點。

    aws ec2 create-vpc-endpoint \ --vpc-id ${VPC_ID} \ --vpc-endpoint-type Gateway \ --service-name "com.amazonaws.${REGION}.s3" \ --route-table-ids $(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
  10. 設定 S3 儲存體存取:

    1. 建立 IAM 政策,授予使用適用於 Amazon S3 的掛載點所需的 S3 許可,讓檔案系統可以從叢集內存取 S3 儲存貯體。

      %%bash -x export S3_CSI_BUCKET_NAME=“<bucketname_for_mounting_through_filesystem>” cat <<EOF> s3accesspolicy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "MountpointAccess", "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetObject", "s3:PutObject", "s3:AbortMultipartUpload", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::${S3_CSI_BUCKET_NAME}", "arn:aws:s3:::${S3_CSI_BUCKET_NAME}/*" ] } ] } EOF aws iam create-policy \ --policy-name S3MountpointAccessPolicy \ --policy-document file://s3accesspolicy.json cat <<EOF> s3accesstrustpolicy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:kube-system:${s3-csi-driver-sa}" } } } ] } EOF aws iam create-role --role-name $S3_CSI_ROLE_NAME --assume-role-policy-document file://s3accesstrustpolicy.json aws iam attach-role-policy --role-name $S3_CSI_ROLE_NAME --policy-arn "arn:aws:iam::$ACCOUNT_ID:policy/S3MountpointAccessPolicy"
    2. (選用) 為 Amazon S3 CSI 驅動程式建立 IAM 服務帳戶。Amazon S3 CSI 驅動程式需要具有適當許可的 IAM 服務帳戶,才能將 S3 儲存貯體掛載為 Amazon EKS 叢集中的持久性磁碟區。此步驟會使用必要的 S3 存取政策建立必要的 IAM 角色和 Kubernetes 服務帳戶。

      %%bash -x export S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$REGION" export S3_CSI_POLICY_ARN=$(aws iam list-policies --query 'Policies[?PolicyName==`S3MountpointAccessPolicy`]' | jq '.[0].Arn' | tr -d '"') eksctl create iamserviceaccount \ --name s3-csi-driver-sa \ --namespace kube-system \ --cluster $EKS_CLUSTER_NAME \ --attach-policy-arn $S3_CSI_POLICY_ARN \ --approve \ --role-name $S3_CSI_ROLE_NAME \ --region $REGION kubectl label serviceaccount s3-csi-driver-sa app.kubernetes.io/component=csi-driver app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver app.kubernetes.io/managed-by=EKS app.kubernetes.io/name=aws-mountpoint-s3-csi-driver -n kube-system --overwrite
    3. (選用) 安裝 Amazon S3 CSI 驅動程式附加元件。此驅動程式可讓您的 Pod 將 S3 儲存貯體掛載為持久性磁碟區,讓您可從 Kubernetes 工作負載內直接存取 S3 儲存體。

      %%bash -x export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME --query 'Role.Arn' --output text) eksctl create addon --name aws-mountpoint-s3-csi-driver --cluster $EKS_CLUSTER_NAME --service-account-role-arn $S3_CSI_ROLE_ARN --force
    4. (選用) 為 S3 儲存體建立持久性磁碟區宣告 (PVC)。此 PVC 可讓您的 Pod 請求和使用 S3 儲存體,好像其是傳統檔案系統一樣。

      %%bash -x cat <<EOF> pvc_s3.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: s3-claim spec: accessModes: - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany storageClassName: "" # required for static provisioning resources: requests: storage: 1200Gi # ignored, required volumeName: s3-pv EOF kubectl apply -f pvc_s3.yaml
  11. (選用) 設定 FSx 儲存體存取。為 Amazon FSx CSI 驅動程式建立 IAM 服務帳戶。FSx CSI 驅動程式將使用此服務帳戶,代表您的叢集與 Amazon FSx 服務互動。

    %%bash -x eksctl create iamserviceaccount \ --name fsx-csi-controller-sa \ --namespace kube-system \ --cluster $EKS_CLUSTER_NAME \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \ --approve \ --role-name FSXLCSI-${EKS_CLUSTER_NAME}-${REGION} \ --region $REGION

建立 KEDA 運算子角色

  1. 建立信任政策和許可政策。

    # Create trust policy cat <<EOF > /tmp/keda-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:kube-system:keda-operator", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy cat <<EOF > /tmp/keda-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "aps:QueryMetrics", "aps:GetLabels", "aps:GetSeries", "aps:GetMetricMetadata" ], "Resource": "*" } ] } EOF # Create the role aws iam create-role \ --role-name keda-operator-role \ --assume-role-policy-document file:///tmp/keda-trust-policy.json # Create the policy KEDA_POLICY_ARN=$(aws iam create-policy \ --policy-name KedaOperatorPolicy \ --policy-document file:///tmp/keda-policy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name keda-operator-role \ --policy-arn $KEDA_POLICY_ARN
  2. 如果您使用的是門控模型,請建立 IAM 角色來存取門控模型。

    1. 建立 IAM 政策。

      %%bash -s $REGION cat <<EOF> /tmp/presignedurl-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "CreatePresignedUrlAccess", "Effect": "Allow", "Action": [ "sagemaker:CreateHubContentPresignedUrls" ], "Resource": [ "arn:aws:sagemaker:$1:aws:hub/SageMakerPublicHub", "arn:aws:sagemaker:$1:aws:hub-content/SageMakerPublicHub/*/*" ] } ] } EOF aws iam create-policy --policy-name PresignedUrlAccessPolicy --policy-document file:///tmp/presignedurl-policy.json JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}" cat <<EOF > /tmp/trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-controller-manager", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } }, { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF
    2. 建立 IAM 角色。

      # Create the role using existing trust policy aws iam create-role \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --assume-role-policy-document file:///tmp/trust-policy.json # Attach the existing PresignedUrlAccessPolicy to the role aws iam attach-role-policy \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/PresignedUrlAccessPolicy
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0] !echo $JUMPSTART_GATED_ROLE_ARN
    3. SageMakerFullAccess 政策新增至執行角色。

      aws iam attach-role-policy --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --policy-arn=arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

安裝推論運算子

  1. 安裝 HyperPod 推論運算子。此步驟會收集必要的 AWS 資源識別碼,並使用適當的組態參數產生 Helm 安裝命令。

    https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart 存取 Helm Chart。

    git clone https://github.com/aws/sagemaker-hyperpod-cli cd sagemaker-hyperpod-cli cd helm_chart/HyperPodHelmChart helm dependencies update charts/inference-operator
    %%bash -x HYPERPOD_INFERENCE_ROLE_ARN=$(aws iam get-role --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text) echo $HYPERPOD_INFERENCE_ROLE_ARN S3_CSI_ROLE_ARN=$(aws iam get-role --role-name=$S3_CSI_ROLE_NAME --query "Role.Arn" --output text) echo $S3_CSI_ROLE_ARN HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --query "ClusterArn") # Verify values echo "Cluster Name: $EKS_CLUSTER_NAME" echo "Execution Role: $HYPERPOD_INFERENCE_ROLE_ARN" echo "Hyperpod ARN: $HYPERPOD_CLUSTER_ARN" # Run the the HyperPod inference operator installation. helm install hyperpod-inference-operator charts/inference-operator \ -n kube-system \ --set region=$REGION \ --set eksClusterName=$EKS_CLUSTER_NAME \ --set hyperpodClusterArn=$HYPERPOD_CLUSTER_ARN \ --set executionRoleArn=$HYPERPOD_INFERENCE_ROLE_ARN \ --set s3.serviceAccountRoleArn=$S3_CSI_ROLE_ARN \ --set s3.node.serviceAccount.create=false \ --set keda.podIdentity.aws.irsa.roleArn="arn:aws:iam::$ACCOUNT_ID:role/keda-operator-role" \ --set tlsCertificateS3Bucket="s3://$BUCKET_NAME" \ --set alb.region=$REGION \ --set alb.clusterName=$EKS_CLUSTER_NAME \ --set alb.vpcId=$VPC_ID # For JumpStart Gated Model usage, Add # --set jumpstartGatedModelDownloadRoleArn=$UMPSTART_GATED_ROLE_ARN
  2. 設定服務帳戶註釋進行 IAM 整合。此註釋可讓運算子的服務帳戶取得必要的 IAM 許可,以管理推論端點並與 AWS 服務互動。

    %%bash -x EKS_CLUSTER_ROLE_NAME=$(echo $EKS_CLUSTER_ROLE | sed 's/.*\///') # Annotate service account kubectl annotate serviceaccount hyperpod-inference-operator-controller-manager \ -n hyperpod-inference-system \ eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/${EKS_CLUSTER_ROLE_NAME} \ --overwrite

驗證推論運算子是否運作中

請依照下列步驟,透過部署和測試簡單的模型,確認您的推論運算子安裝是否正常運作。

部署測試模型以驗證運算子
  1. 建立模型部署組態檔案。這會建立 Kubernetes 資訊清單檔案,為 HyperPod 推論運算子定義 JumpStart 模型部署。

    cat <<EOF>> simple_model_install.yaml --- apiVersion: inference.sagemaker.aws.amazon.com/v1 kind: JumpStartModel metadata: name: testing-deployment-bert namespace: default spec: model: modelId: "huggingface-eqa-bert-base-cased" sageMakerEndpoint: name: "hp-inf-ep-for-testing" server: instanceType: "ml.c5.2xlarge" environmentVariables: - name: SAMPLE_ENV_VAR value: "sample_value" maxDeployTimeInSeconds: 1800 EOF
  2. 部署模型並清除組態檔案。

    kubectl create -f simple_model_install.yaml rm -f simple_model_install.yaml
  3. 驗證服務帳戶組態,以確保操作員可以取得 AWS 許可。

    # Get the service account details kubectl get serviceaccount -n hyperpod-inference-system # Check if the service account has the AWS annotations kubectl describe serviceaccount hyperpod-inference-operator-controller-manager -n hyperpod-inference-system
設定部署設定 (如果使用 Studio UI)
  1. 部署設定下檢閱建議的執行個體類型。

  2. 如果修改執行個體類型,請確保與您的 HyperPod 叢集相容。如果無法使用相容的執行個體,請聯絡您的管理員。

  3. 對於已啟用 MIG 的 GPU 分割執行個體,請從可用的 MIG 設定檔中選取適當的 GPU 分割區,以最佳化 GPU 使用率。如需詳細資訊,請參閱在 Amazon SageMaker HyperPod 中使用 GPU 分割區

  4. 如果使用任務控管,請設定模型部署先佔功能的優先順序設定。

  5. 輸入管理員提供的命名空間。如有需要,請聯絡您的管理員以取得正確的命名空間。

(選用) 透過 SageMaker AI Studio Classic 中的 JumpStart UI 設定使用者存取

如需為 Studio Classic 使用者設定 SageMaker HyperPod 存取,以及為資料科學家使用者設定精細 Kubernetes RBAC 許可的更多背景,請參閱在 Studio 中設定 Amazon EKS 叢集設定 Kubernetes 角色型存取控制

  1. 識別資料科學家使用者將用來從 SageMaker AI Studio Classic 管理模型並將其部署至 SageMaker HyperPod 的 IAM 角色。這通常是 Studio Classic 使用者的使用者設定檔執行角色或網域執行角色。

    %%bash -x export DATASCIENTIST_ROLE_NAME="<Execution Role Name used in SageMaker Studio Classic>" export DATASCIENTIST_POLICY_NAME="HyperPodUIAccessPolicy" export EKS_CLUSTER_ARN=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text) export DATASCIENTIST_HYPERPOD_NAMESPACE="team-namespace"
  2. 連接啟用模型部署存取的身分政策。

    %%bash -x # Create access policy cat << EOF > hyperpod-deployment-ui-access-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "DescribeHyerpodClusterPermissions", "Effect": "Allow", "Action": [ "sagemaker:DescribeCluster" ], "Resource": "$HYPERPOD_CLUSTER_ARN" }, { "Sid": "UseEksClusterPermissions", "Effect": "Allow", "Action": [ "eks:DescribeCluster", "eks:AccessKubernetesApi", "eks:MutateViaKubernetesApi", "eks:DescribeAddon" ], "Resource": "$EKS_CLUSTER_ARN" }, { "Sid": "ListPermission", "Effect": "Allow", "Action": [ "sagemaker:ListClusters", "sagemaker:ListEndpoints" ], "Resource": "*" }, { "Sid": "SageMakerEndpointAccess", "Effect": "Allow", "Action": [ "sagemaker:DescribeEndpoint", "sagemaker:InvokeEndpoint" ], "Resource": "arn:aws:sagemaker:$REGION:$ACCOUNT_ID:endpoint/*" } ] } EOF aws iam put-role-policy --role-name DATASCIENTIST_ROLE_NAME --policy-name HyperPodDeploymentUIAccessInlinePolicy --policy-document file://hyperpod-deployment-ui-access-policy.json
  3. 建立一個 EKS 存取項目,讓使用者將其對應至 kubernetes 群組。

    %%bash -x aws eks create-access-entry --cluster-name $EKS_CLUSTER_NAME \ --principal-arn "arn:aws:iam::$ACCOUNT_ID:role/$DATASCIENTIST_ROLE_NAME" \ --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]'
  4. 為使用者建立 Kubernetes RBAC 政策。

    %%bash -x cat << EOF > cluster_level_config.yaml kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: hyperpod-scientist-user-cluster-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["list"] - apiGroups: [""] resources: ["nodes"] verbs: ["list"] - apiGroups: [""] resources: ["namespaces"] verbs: ["list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: hyperpod-scientist-user-cluster-role-binding subjects: - kind: Group name: hyperpod-scientist-user-cluster-level apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: hyperpod-scientist-user-cluster-role apiGroup: rbac.authorization.k8s.io EOF kubectl apply -f cluster_level_config.yaml cat << EOF > namespace_level_role.yaml kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE name: hyperpod-scientist-user-namespace-level-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["create", "get"] - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/exec"] verbs: ["get", "create"] - apiGroups: ["kubeflow.org"] resources: ["pytorchjobs", "pytorchjobs/status"] verbs: ["get", "list", "create", "delete", "update", "describe"] - apiGroups: [""] resources: ["configmaps"] verbs: ["create", "update", "get", "list", "delete"] - apiGroups: [""] resources: ["secrets"] verbs: ["create", "get", "list", "delete"] - apiGroups: [ "inference.sagemaker.aws.amazon.com" ] resources: [ "inferenceendpointconfig", "inferenceendpoint", "jumpstartmodel" ] verbs: [ "get", "list", "create", "delete", "update", "describe" ] - apiGroups: [ "autoscaling" ] resources: [ "horizontalpodautoscalers" ] verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE name: hyperpod-scientist-user-namespace-level-role-binding subjects: - kind: Group name: hyperpod-scientist-user-namespace-level apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: hyperpod-scientist-user-namespace-level-role apiGroup: rbac.authorization.k8s.io EOF kubectl apply -f namespace_level_role.yaml