設定 HyperPod 叢集以進行模型部署 - Amazon SageMaker AI

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

設定 HyperPod 叢集以進行模型部署

本指南為您提供在 Amazon SageMaker HyperPod 叢集上啟用推論功能的全面設定指南。下列步驟可協助您設定所需的基礎設施、許可和運算子,以支援機器學習工程師部署和管理推論端點。

注意

若要建立已預先安裝推論運算子的叢集,請參閱 建立 EKS 協作的 SageMaker HyperPod 叢集。若要在現有叢集上安裝推論運算子,請繼續執行下列程序。

先決條件

在繼續之前,請確認您的 AWS 登入資料已正確設定並具有必要的許可。IAM 主體需要執行下列步驟,具有 Amazon EKS 叢集的管理員權限和叢集管理員存取權。驗證您是否已使用 使用 Amazon EKS 協同運作建立 SageMaker HyperPod 叢集 建立 HyperPod 叢集。驗證您是否已安裝 helm、eksctl 和 kubectl 命令列公用程式。

如需 Amazon EKS 叢集的 Kubernetes 管理存取權,請前往 Amazon EKS 主控台,然後選取您正在使用的叢集。查看存取索引標籤,然後選取 IAM 存取項目。如果您的 IAM 主體沒有項目,請選取建立存取項目。然後選取所需的 IAM 主體,並將 AmazonEKSClusterAdminPolicy 與其建立關聯。

  1. 設定 kubectl 以連線至 Amazon EKS 叢集協調的新建立 HyperPod 叢集。指定區域和 HyperPod 叢集名稱。

    export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name> export REGION=<region> # S3 bucket where tls certificates will be uploaded BUCKET_NAME="<Enter name of your s3 bucket>" # This should be bucket name, not URI export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text | \ cut -d'/' -f2) aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
  2. 設定預設 env 變數。

    LB_CONTROLLER_POLICY_NAME="AWSLoadBalancerControllerIAMPolicy-$HYPERPOD_CLUSTER_NAME" LB_CONTROLLER_ROLE_NAME="aws-load-balancer-controller-$HYPERPOD_CLUSTER_NAME" S3_MOUNT_ACCESS_POLICY_NAME="S3MountpointAccessPolicy-$HYPERPOD_CLUSTER_NAME" S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$HYPERPOD_CLUSTER_NAME" KEDA_OPERATOR_POLICY_NAME="KedaOperatorPolicy-$HYPERPOD_CLUSTER_NAME" KEDA_OPERATOR_ROLE_NAME="keda-operator-role-$HYPERPOD_CLUSTER_NAME" PRESIGNED_URL_ACCESS_POLICY_NAME="PresignedUrlAccessPolicy-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_ACCESS_POLICY_NAME="HyperpodInferenceAccessPolicy-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_ROLE_NAME="HyperpodInferenceRole-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_SA_NAME="hyperpod-inference-operator-controller" HYPERPOD_INFERENCE_SA_NAMESPACE="hyperpod-inference-system" JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-$HYPERPOD_CLUSTER_NAME" FSX_CSI_ROLE_NAME="AmazonEKSFSxLustreCSIDriverFullAccess-$HYPERPOD_CLUSTER_NAME"
  3. 從叢集 ARN 擷取 Amazon EKS 叢集名稱、更新本機 kubeconfig,並透過列出跨命名空間的所有 Pod 來驗證連線。

    kubectl get pods --all-namespaces
  4. (選用) 安裝 NVIDIA 裝置外掛程式,以在叢集上啟用 GPU 支援。

    #Install nvidia device plugin kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml # Verify that GPUs are visible to k8s kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu

準備您的環境以進行推論運算子安裝

現在已設定您的 HyperPod 叢集,下一步就是安裝推論運算子。推論運算子是一種 Kubernetes 運算子,可讓您在 Amazon EKS 叢集上部署和管理機器學習推論端點。

完成下一個關鍵準備步驟,以確保您的 Amazon EKS 叢集具有適當的安全組態和支援基礎設施元件。這包括為跨服務身分驗證設定 IAM 角色和安全性政策、為輸入管理安裝 AWS Load Balancer控制器、為持久性儲存存取設定 Amazon S3 和 Amazon FSx CSI 驅動程式,以及為自動擴展和憑證管理功能部署 KEDA 和 cert-manager。

  1. 收集必要的 AWS 資源識別符和 ARNs,以設定 Amazon EKS、SageMaker AI 和 IAM 元件之間的服務整合。

    %%bash -x export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text) export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5) export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
  2. 將 IAM OIDCidentity 提供者與您的叢集建立關聯。

    eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
  3. 建立 HyperPod 推論運算子 IAM 角色所需的信任政策和許可政策 JSON 文件。這些政策可在 Amazon EKS、SageMaker AI 和其他 AWS 服務之間啟用安全的跨服務通訊。

    bash # Create trust policy JSON cat << EOF > trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "sagemaker.amazonaws.com" ] }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:*:*" } } } ] } EOF # Create permission policy JSON cat << EOF > permission-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "S3Access", "Effect": "Allow", "Action": [ "s3:Get*", "s3:List*", "s3:Describe*", "s3:PutObject" ], "Resource": [ "*" ] }, { "Sid": "ECRAccess", "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:DescribeImages", "ecr:BatchGetImage", "ecr:GetLifecyclePolicy", "ecr:GetLifecyclePolicyPreview", "ecr:ListTagsForResource", "ecr:DescribeImageScanFindings" ], "Resource": [ "*" ] }, { "Sid": "EC2Access", "Effect": "Allow", "Action": [ "ec2:AssignPrivateIpAddresses", "ec2:AttachNetworkInterface", "ec2:CreateNetworkInterface", "ec2:DeleteNetworkInterface", "ec2:DescribeInstances", "ec2:DescribeTags", "ec2:DescribeNetworkInterfaces", "ec2:DescribeInstanceTypes", "ec2:DescribeSubnets", "ec2:DetachNetworkInterface", "ec2:ModifyNetworkInterfaceAttribute", "ec2:UnassignPrivateIpAddresses", "ec2:CreateTags", "ec2:DescribeInstances", "ec2:DescribeInstanceTypes", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSubnets", "ec2:DescribeVolumes", "ec2:DescribeVolumesModifications", "ec2:DescribeVpcs", "ec2:CreateVpcEndpointServiceConfiguration", "ec2:DeleteVpcEndpointServiceConfigurations", "ec2:DescribeVpcEndpointServiceConfigurations", "ec2:ModifyVpcEndpointServicePermissions" ], "Resource": [ "*" ] }, { "Sid": "EKSAuthAccess", "Effect": "Allow", "Action": [ "eks-auth:AssumeRoleForPodIdentity" ], "Resource": [ "*" ] }, { "Sid": "EKSAccess", "Effect": "Allow", "Action": [ "eks:AssociateAccessPolicy", "eks:Describe*", "eks:List*", "eks:AccessKubernetesApi" ], "Resource": [ "*" ] }, { "Sid": "ApiGatewayAccess", "Effect": "Allow", "Action": [ "apigateway:POST", "apigateway:GET", "apigateway:PUT", "apigateway:PATCH", "apigateway:DELETE", "apigateway:UpdateRestApiPolicy" ], "Resource": [ "arn:aws:apigateway:*::/vpclinks", "arn:aws:apigateway:*::/vpclinks/*", "arn:aws:apigateway:*::/restapis", "arn:aws:apigateway:*::/restapis/*" ] }, { "Sid": "ElasticLoadBalancingAccess", "Effect": "Allow", "Action": [ "elasticloadbalancing:CreateLoadBalancer", "elasticloadbalancing:DescribeLoadBalancers", "elasticloadbalancing:DescribeLoadBalancerAttributes", "elasticloadbalancing:DescribeListeners", "elasticloadbalancing:DescribeListenerCertificates", "elasticloadbalancing:DescribeSSLPolicies", "elasticloadbalancing:DescribeRules", "elasticloadbalancing:DescribeTargetGroups", "elasticloadbalancing:DescribeTargetGroupAttributes", "elasticloadbalancing:DescribeTargetHealth", "elasticloadbalancing:DescribeTags", "elasticloadbalancing:DescribeTrustStores", "elasticloadbalancing:DescribeListenerAttributes" ], "Resource": [ "*" ] }, { "Sid": "SageMakerAccess", "Effect": "Allow", "Action": [ "sagemaker:*" ], "Resource": [ "*" ] }, { "Sid": "AllowPassRoleToSageMaker", "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": "arn:aws:iam::*:role/*", "Condition": { "StringEquals": { "iam:PassedToService": "sagemaker.amazonaws.com" } } }, { "Sid": "AcmAccess", "Effect": "Allow", "Action": [ "acm:ImportCertificate", "acm:DeleteCertificate" ], "Resource": [ "*" ] } ] } EOF
  4. 建立推論運算子的執行角色。

    aws iam create-policy --policy-name $HYPERPOD_INFERENCE_ACCESS_POLICY_NAME --policy-document file://permission-policy.json export policy_arn="arn:aws:iam::${ACCOUNT_ID}:policy/$HYPERPOD_INFERENCE_ACCESS_POLICY_NAME"
    aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json aws iam put-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-name InferenceOperatorInlinePolicy --policy-document file://permission-policy.json
  5. 下載並建立 AWS Load Balancer控制器所需的 IAM 政策,以在 EKS 叢集中管理 Application Load Balancer 和 Network Load Balancer。

    %%bash -x export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json aws iam create-policy --policy-name $ALBController_IAM_POLICY_NAME --policy-document file://AWSLoadBalancerControllerIAMPolicy.json
  6. 建立將 Kubernetes 服務帳戶與 IAM 政策連結的 IAM 服務帳戶,讓 AWS Load Balancer控制器能夠透過 IRSA (服務帳戶的 IAM 角色) 取得必要的 AWS 許可。

    %%bash -x export ALB_POLICY_ARN="arn:aws:iam::$ACCOUNT_ID:policy/$ALBController_IAM_POLICY_NAME" # Create IAM service account with gathered values eksctl create iamserviceaccount \ --approve \ --override-existing-serviceaccounts \ --name=aws-load-balancer-controller \ --namespace=kube-system \ --cluster=$EKS_CLUSTER_NAME \ --attach-policy-arn=$ALB_POLICY_ARN \ --region=$REGION # Print the values for verification echo "Cluster Name: $EKS_CLUSTER_NAME" echo "Region: $REGION" echo "Policy ARN: $ALB_POLICY_ARN"
  7. 將標籤 (kubernetes.io.role/elb) 套用至 Amazon EKS 叢集中的所有子網路 (公有和私有)。

    export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text) # Add Tags aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | \ xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1 # Verify Tags are added aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
  8. 為 KEDA 和 Cert Manager 建立命名空間。

    kubectl create namespace keda kubectl create namespace cert-manager
  9. 建立 Amazon S3 VPC 端點。

    aws ec2 create-vpc-endpoint \ --vpc-id ${VPC_ID} \ --vpc-endpoint-type Gateway \ --service-name "com.amazonaws.${REGION}.s3" \ --route-table-ids $(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
  10. 設定 S3 儲存體存取:

    1. 建立 IAM 政策,授予使用適用於 Amazon S3 的掛載點所需的 S3 許可,讓檔案系統可以從叢集內存取 S3 儲存貯體。

      %%bash -x export S3_CSI_BUCKET_NAME=“<bucketname_for_mounting_through_filesystem>” cat <<EOF> s3accesspolicy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "MountpointAccess", "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetObject", "s3:PutObject", "s3:AbortMultipartUpload", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::${S3_CSI_BUCKET_NAME}", "arn:aws:s3:::${S3_CSI_BUCKET_NAME}/*" ] } ] } EOF aws iam create-policy \ --policy-name S3MountpointAccessPolicy \ --policy-document file://s3accesspolicy.json cat <<EOF> s3accesstrustpolicy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:kube-system:${s3-csi-driver-sa}" } } } ] } EOF aws iam create-role --role-name $S3_CSI_ROLE_NAME --assume-role-policy-document file://s3accesstrustpolicy.json aws iam attach-role-policy --role-name $S3_CSI_ROLE_NAME --policy-arn "arn:aws:iam::$ACCOUNT_ID:policy/S3MountpointAccessPolicy"
    2. (選用) 為 Amazon S3 CSI 驅動程式建立 IAM 服務帳戶。Amazon S3 CSI 驅動程式需要具有適當許可的 IAM 服務帳戶,才能將 S3 儲存貯體掛載為 Amazon EKS 叢集中的持久性磁碟區。此步驟會使用必要的 S3 存取政策建立必要的 IAM 角色和 Kubernetes 服務帳戶。

      %%bash -x export S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$REGION" export S3_CSI_POLICY_ARN=$(aws iam list-policies --query 'Policies[?PolicyName==`S3MountpointAccessPolicy`]' | jq '.[0].Arn' | tr -d '"') eksctl create iamserviceaccount \ --name s3-csi-driver-sa \ --namespace kube-system \ --cluster $EKS_CLUSTER_NAME \ --attach-policy-arn $S3_CSI_POLICY_ARN \ --approve \ --role-name $S3_CSI_ROLE_NAME \ --region $REGION kubectl label serviceaccount s3-csi-driver-sa app.kubernetes.io/component=csi-driver app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver app.kubernetes.io/managed-by=EKS app.kubernetes.io/name=aws-mountpoint-s3-csi-driver -n kube-system --overwrite
    3. (選用) 安裝 Amazon S3 CSI 驅動程式附加元件。此驅動程式可讓您的 Pod 將 S3 儲存貯體掛載為持久性磁碟區,讓您可從 Kubernetes 工作負載內直接存取 S3 儲存體。

      %%bash -x export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME --query 'Role.Arn' --output text) eksctl create addon --name aws-mountpoint-s3-csi-driver --cluster $EKS_CLUSTER_NAME --service-account-role-arn $S3_CSI_ROLE_ARN --force
    4. (選用) 為 S3 儲存體建立持久性磁碟區宣告 (PVC)。此 PVC 可讓您的 Pod 請求和使用 S3 儲存體,好像其是傳統檔案系統一樣。

      %%bash -x cat <<EOF> pvc_s3.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: s3-claim spec: accessModes: - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany storageClassName: "" # required for static provisioning resources: requests: storage: 1200Gi # ignored, required volumeName: s3-pv EOF kubectl apply -f pvc_s3.yaml
  11. (選用) 設定 FSx 儲存體存取。為 Amazon FSx CSI 驅動程式建立 IAM 服務帳戶。FSx CSI 驅動程式將使用此服務帳戶,代表您的叢集與 Amazon FSx 服務互動。

    %%bash -x eksctl create iamserviceaccount \ --name fsx-csi-controller-sa \ --namespace kube-system \ --cluster $EKS_CLUSTER_NAME \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \ --approve \ --role-name FSXLCSI-${EKS_CLUSTER_NAME}-${REGION} \ --region $REGION

建立 KEDA 運算子角色

  1. 建立信任政策和許可政策。

    # Create trust policy cat <<EOF > /tmp/keda-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:kube-system:keda-operator", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy cat <<EOF > /tmp/keda-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "aps:QueryMetrics", "aps:GetLabels", "aps:GetSeries", "aps:GetMetricMetadata" ], "Resource": "*" } ] } EOF # Create the role aws iam create-role \ --role-name keda-operator-role \ --assume-role-policy-document file:///tmp/keda-trust-policy.json # Create the policy KEDA_POLICY_ARN=$(aws iam create-policy \ --policy-name KedaOperatorPolicy \ --policy-document file:///tmp/keda-policy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name keda-operator-role \ --policy-arn $KEDA_POLICY_ARN
  2. 如果您使用的是門控模型,請建立 IAM 角色來存取門控模型。

    1. 建立 IAM 政策。

      %%bash -s $REGION cat <<EOF> /tmp/presignedurl-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "CreatePresignedUrlAccess", "Effect": "Allow", "Action": [ "sagemaker:CreateHubContentPresignedUrls" ], "Resource": [ "arn:aws:sagemaker:$1:aws:hub/SageMakerPublicHub", "arn:aws:sagemaker:$1:aws:hub-content/SageMakerPublicHub/*/*" ] } ] } EOF aws iam create-policy --policy-name PresignedUrlAccessPolicy --policy-document file:///tmp/presignedurl-policy.json JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}" cat <<EOF > /tmp/trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:$HYPERPOD_INFERENCE_SA_NAME", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } }, { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF
    2. 建立 IAM 角色。

      # Create the role using existing trust policy aws iam create-role \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --assume-role-policy-document file:///tmp/trust-policy.json # Attach the existing PresignedUrlAccessPolicy to the role aws iam attach-role-policy \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/PresignedUrlAccessPolicy
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0] !echo $JUMPSTART_GATED_ROLE_ARN
    3. SageMakerFullAccess 政策新增至執行角色。

      aws iam attach-role-policy --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --policy-arn=arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

安裝推論運算子

  1. 安裝 HyperPod 推論運算子。此步驟會收集必要的 AWS 資源識別碼,並使用適當的組態參數產生 Helm 安裝命令。

    https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart 存取 Helm Chart。

    git clone https://github.com/aws/sagemaker-hyperpod-cli cd sagemaker-hyperpod-cli cd helm_chart/HyperPodHelmChart helm dependencies update charts/inference-operator
    %%bash -x HYPERPOD_INFERENCE_ROLE_ARN=$(aws iam get-role --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text) echo $HYPERPOD_INFERENCE_ROLE_ARN S3_CSI_ROLE_ARN=$(aws iam get-role --role-name=$S3_CSI_ROLE_NAME --query "Role.Arn" --output text) echo $S3_CSI_ROLE_ARN HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --query "ClusterArn") # Verify values echo "Cluster Name: $EKS_CLUSTER_NAME" echo "Execution Role: $HYPERPOD_INFERENCE_ROLE_ARN" echo "Hyperpod ARN: $HYPERPOD_CLUSTER_ARN" # Run the the HyperPod inference operator installation. helm install hyperpod-inference-operator charts/inference-operator \ -n kube-system \ --set region=$REGION \ --set eksClusterName=$EKS_CLUSTER_NAME \ --set hyperpodClusterArn=$HYPERPOD_CLUSTER_ARN \ --set executionRoleArn=$HYPERPOD_INFERENCE_ROLE_ARN \ --set s3.serviceAccountRoleArn=$S3_CSI_ROLE_ARN \ --set s3.node.serviceAccount.create=false \ --set keda.podIdentity.aws.irsa.roleArn="arn:aws:iam::$ACCOUNT_ID:role/keda-operator-role" \ --set tlsCertificateS3Bucket="s3://$BUCKET_NAME" \ --set alb.region=$REGION \ --set alb.clusterName=$EKS_CLUSTER_NAME \ --set alb.vpcId=$VPC_ID # For JumpStart Gated Model usage, Add # --set jumpstartGatedModelDownloadRoleArn=$UMPSTART_GATED_ROLE_ARN
  2. 設定服務帳戶註釋進行 IAM 整合。此註釋可讓運算子的服務帳戶取得必要的 IAM 許可,以管理推論端點並與 AWS 服務互動。

    %%bash -x EKS_CLUSTER_ROLE_NAME=$(echo $EKS_CLUSTER_ROLE | sed 's/.*\///') # Annotate service account kubectl annotate serviceaccount hyperpod-inference-operator-controller-manager \ -n hyperpod-inference-system \ eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/${EKS_CLUSTER_ROLE_NAME} \ --overwrite

驗證推論運算子是否運作中

  1. 建立模型部署組態檔案。這會建立 Kubernetes 資訊清單檔案,為 HyperPod 推論運算子定義 JumpStart 模型部署。組態指定如何從 Amazon SageMaker JumpStart 部署預先訓練的模型,做為 Amazon EKS 叢集上的推論端點。

    cat <<EOF>> simple_model_install.yaml --- apiVersion: inference.sagemaker.aws.amazon.com/v1 kind: JumpStartModel metadata: name: testing-deployment-bert namespace: default spec: model: modelId: "huggingface-eqa-bert-base-cased" sageMakerEndpoint: name: "hp-inf-ep-for-testing" server: instanceType: "ml.c5.2xlarge" environmentVariables: - name: SAMPLE_ENV_VAR value: "sample_value" maxDeployTimeInSeconds: 1800 EOF
  2. 部署模型並清除組態檔案。此步驟會建立 JumpStart 模型資源,並移除暫時組態檔案以維護乾淨的工作區。

    %%bash -x kubectl create -f simple_model_install.yaml rm -rfv simple_model_install.yaml
  3. 檢查模型是否已安裝並執行中。此驗證可確保運算子可以成功取得管理推論端點的 AWS 許可。

    %%bash # Get the service account details kubectl get serviceaccount -n hyperpod-inference-system # Check if the service account has the AWS annotations kubectl describe serviceaccount hyperpod-inference-operator-controller-manager -n hyperpod-inference-system
  4. 部署設定下,JumpStart 將建議一個執行個體進行部署。如有必要,您可以修改這些設定。

    1. 如果您修改執行個體類型,請確定它與所選的 HyperPod 叢集相容。如果沒有任何相容的執行個體,您將需要選取新的 HyperPod 叢集,或聯絡您的管理員,將相容的執行個體新增至叢集。

    2. 如果您選取的執行個體類型支援 GPU 分割,且您的叢集已設定 MIG,您可以從可用的 MIG 設定檔中選取特定的 GPU 分割區。這可讓您僅配置模型所需的 GPU 資源,以最佳化 GPU 使用率。如需詳細資訊,請參閱在 Amazon SageMaker HyperPod 中使用 GPU 分割區

    3. 若要排定模型部署的優先順序,請安裝任務治理附加元件、建立運算配置,以及設定叢集政策的任務排名。一旦完成此操作,您應該會看到一個選項,用來為模型部署選取優先順序,這可用於先佔叢集上其他部署和任務的優先權。

    4. 輸入管理員已為您提供其存取權的命名空間。您可能必須直接聯絡您的管理員,以取得確切的命名空間。一旦提供了有效的命名空間,就應啟用部署按鈕以部署模型。

(選用) 透過 SageMaker AI Studio Classic 中的 JumpStart UI 設定使用者存取

如需為 Studio Classic 使用者設定 SageMaker HyperPod 存取,以及為資料科學家使用者設定精細 Kubernetes RBAC 許可的更多背景,請參閱在 Studio 中設定 Amazon EKS 叢集設定 Kubernetes 角色型存取控制

  1. 識別資料科學家使用者將用來從 SageMaker AI Studio Classic 管理模型並將其部署至 SageMaker HyperPod 的 IAM 角色。這通常是 Studio Classic 使用者的使用者設定檔執行角色或網域執行角色。

    %%bash -x export DATASCIENTIST_ROLE_NAME="<Execution Role Name used in SageMaker Studio Classic>" export DATASCIENTIST_POLICY_NAME="HyperPodUIAccessPolicy" export EKS_CLUSTER_ARN=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text) export DATASCIENTIST_HYPERPOD_NAMESPACE="team-namespace"
  2. 連接啟用模型部署存取的身分政策。

    %%bash -x # Create access policy cat << EOF > hyperpod-deployment-ui-access-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "DescribeHyerpodClusterPermissions", "Effect": "Allow", "Action": [ "sagemaker:DescribeCluster" ], "Resource": "$HYPERPOD_CLUSTER_ARN" }, { "Sid": "UseEksClusterPermissions", "Effect": "Allow", "Action": [ "eks:DescribeCluster", "eks:AccessKubernetesApi", "eks:MutateViaKubernetesApi", "eks:DescribeAddon" ], "Resource": "$EKS_CLUSTER_ARN" }, { "Sid": "ListPermission", "Effect": "Allow", "Action": [ "sagemaker:ListClusters", "sagemaker:ListEndpoints" ], "Resource": "*" }, { "Sid": "SageMakerEndpointAccess", "Effect": "Allow", "Action": [ "sagemaker:DescribeEndpoint", "sagemaker:InvokeEndpoint" ], "Resource": "arn:aws:sagemaker:$REGION:$ACCOUNT_ID:endpoint/*" } ] } EOF aws iam put-role-policy --role-name DATASCIENTIST_ROLE_NAME --policy-name HyperPodDeploymentUIAccessInlinePolicy --policy-document file://hyperpod-deployment-ui-access-policy.json
  3. 建立一個 EKS 存取項目,讓使用者將其對應至 kubernetes 群組。

    %%bash -x aws eks create-access-entry --cluster-name $EKS_CLUSTER_NAME \ --principal-arn "arn:aws:iam::$ACCOUNT_ID:role/$DATASCIENTIST_ROLE_NAME" \ --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]'
  4. 為使用者建立 Kubernetes RBAC 政策。

    %%bash -x cat << EOF > cluster_level_config.yaml kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: hyperpod-scientist-user-cluster-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["list"] - apiGroups: [""] resources: ["nodes"] verbs: ["list"] - apiGroups: [""] resources: ["namespaces"] verbs: ["list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: hyperpod-scientist-user-cluster-role-binding subjects: - kind: Group name: hyperpod-scientist-user-cluster-level apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: hyperpod-scientist-user-cluster-role apiGroup: rbac.authorization.k8s.io EOF kubectl apply -f cluster_level_config.yaml cat << EOF > namespace_level_role.yaml kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE name: hyperpod-scientist-user-namespace-level-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["create", "get"] - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/exec"] verbs: ["get", "create"] - apiGroups: ["kubeflow.org"] resources: ["pytorchjobs", "pytorchjobs/status"] verbs: ["get", "list", "create", "delete", "update", "describe"] - apiGroups: [""] resources: ["configmaps"] verbs: ["create", "update", "get", "list", "delete"] - apiGroups: [""] resources: ["secrets"] verbs: ["create", "get", "list", "delete"] - apiGroups: [ "inference.sagemaker.aws.amazon.com" ] resources: [ "inferenceendpointconfig", "inferenceendpoint", "jumpstartmodel" ] verbs: [ "get", "list", "create", "delete", "update", "describe" ] - apiGroups: [ "autoscaling" ] resources: [ "horizontalpodautoscalers" ] verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE name: hyperpod-scientist-user-namespace-level-role-binding subjects: - kind: Group name: hyperpod-scientist-user-namespace-level apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: hyperpod-scientist-user-namespace-level-role apiGroup: rbac.authorization.k8s.io EOF kubectl apply -f namespace_level_role.yaml