

本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# Amazon EKS 協調的 SageMaker HyperPod 叢集上訓練任務的模型可觀測性
<a name="sagemaker-hyperpod-eks-cluster-observability-model"></a>

與 Amazon EKS 協調的 SageMaker HyperPod 叢集可與 [Amazon SageMaker Studio 上的 MLflow 應用程式](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html)整合。叢集管理員會設定 MLflow 伺服器，並將其與 SageMaker HyperPod 叢集連線。資料科學家可以深入了解模型。

**使用 CLI AWS 設定 MLflow 伺服器**

叢集管理員必須建立 MLflow 追蹤伺服器。

1. 依照使用 CLI 建立追蹤伺服器的指示，建立 SageMaker AI MLflow 追蹤伺服器。 [AWS](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-cli.html#mlflow-create-tracking-server-cli-infra-setup)

1. 確定 [https://docs.aws.amazon.com/eks/latest/APIReference/API_auth_AssumeRoleForPodIdentity.html](https://docs.aws.amazon.com/eks/latest/APIReference/API_auth_AssumeRoleForPodIdentity.html) 許可存在於 SageMaker HyperPod 的 IAM 執行角色中。

1. 如果尚未在您的 EKS 叢集上安裝 `eks-pod-identity-agent` 附加元件，請在 EKS 叢集上安裝附加元件。

   ```
   aws eks create-addon \
       --cluster-name <eks_cluster_name> \
       --addon-name eks-pod-identity-agent \
       --addon-version vx.y.z-eksbuild.1
   ```

1. 為 Pod 的新角色建立 `trust-relationship.json` 檔案，以呼叫 MLflow API。

   ```
   cat >trust-relationship.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
               "Effect": "Allow",
               "Principal": {
                   "Service": "pods.eks.amazonaws.com"
   
               },
               "Action": [
                   "sts:AssumeRole",
                   "sts:TagSession"
               ]
           }
       ]
   }
   EOF
   ```

   執行下列命令，以建立角色並連接信任關係。

   ```
   aws iam create-role --role-name hyperpod-mlflow-role \
       --assume-role-policy-document file://trust-relationship.json \
       --description "allow pods to emit mlflow metrics and put data in s3"
   ```

1. 建立下列政策，授予 Pod 可以呼叫所有 `sagemaker-mlflow` 操作和將模型成品放入 S3 中。S3 許可已存在於追蹤伺服器內，但如果模型成品太大，則會從 MLflow 程式碼進行對 s3 的直接呼叫以上傳成品。

   ```
   cat >hyperpod-mlflow-policy.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker-mlflow:AccessUI",
                   "sagemaker-mlflow:CreateExperiment",
                   "sagemaker-mlflow:SearchExperiments",
                   "sagemaker-mlflow:GetExperiment",
                   "sagemaker-mlflow:GetExperimentByName",
                   "sagemaker-mlflow:DeleteExperiment",
                   "sagemaker-mlflow:RestoreExperiment",
                   "sagemaker-mlflow:UpdateExperiment",
                   "sagemaker-mlflow:CreateRun",
                   "sagemaker-mlflow:DeleteRun",
                   "sagemaker-mlflow:RestoreRun",
                   "sagemaker-mlflow:GetRun",
                   "sagemaker-mlflow:LogMetric",
                   "sagemaker-mlflow:LogBatch",
                   "sagemaker-mlflow:LogModel",
                   "sagemaker-mlflow:LogInputs",
                   "sagemaker-mlflow:SetExperimentTag",
                   "sagemaker-mlflow:SetTag",
                   "sagemaker-mlflow:DeleteTag",
                   "sagemaker-mlflow:LogParam",
                   "sagemaker-mlflow:GetMetricHistory",
                   "sagemaker-mlflow:SearchRuns",
                   "sagemaker-mlflow:ListArtifacts",
                   "sagemaker-mlflow:UpdateRun",
                   "sagemaker-mlflow:CreateRegisteredModel",
                   "sagemaker-mlflow:GetRegisteredModel",
                   "sagemaker-mlflow:RenameRegisteredModel",
                   "sagemaker-mlflow:UpdateRegisteredModel",
                   "sagemaker-mlflow:DeleteRegisteredModel",
                   "sagemaker-mlflow:GetLatestModelVersions",
                   "sagemaker-mlflow:CreateModelVersion",
                   "sagemaker-mlflow:GetModelVersion",
                   "sagemaker-mlflow:UpdateModelVersion",
                   "sagemaker-mlflow:DeleteModelVersion",
                   "sagemaker-mlflow:SearchModelVersions",
                   "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts",
                   "sagemaker-mlflow:TransitionModelVersionStage",
                   "sagemaker-mlflow:SearchRegisteredModels",
                   "sagemaker-mlflow:SetRegisteredModelTag",
                   "sagemaker-mlflow:DeleteRegisteredModelTag",
                   "sagemaker-mlflow:DeleteModelVersionTag",
                   "sagemaker-mlflow:DeleteRegisteredModelAlias",
                   "sagemaker-mlflow:SetRegisteredModelAlias",
                   "sagemaker-mlflow:GetModelVersionByAlias"
               ],
               "Resource": "arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:PutObject"
               ],
               "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>"
           }
       ]
   }
   EOF
   ```
**注意**  
這些·ARN 來自 MLflow 伺服器，以及在您建立 MLflow 伺服器期間遵循[設定 MLflow 基礎設施](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-cli.html#mlflow-create-tracking-server-cli-infra-setup)中的指示，使用該伺服器設定的 S3 儲存貯體。

1. 使用上一個步驟中儲存的政策文件，將 `mlflow-metrics-emit-policy` 政策連接至 `hyperpod-mlflow-role`。

   ```
   aws iam put-role-policy \
     --role-name hyperpod-mlflow-role \
     --policy-name mlflow-metrics-emit-policy \
     --policy-document file://hyperpod-mlflow-policy.json
   ```

1. 為 Pod 建立 Kubernetes 服務帳戶以存取 MLflow 伺服器。

   ```
   cat >mlflow-service-account.yaml <<EOF
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     name: mlflow-service-account
     namespace: kubeflow
   EOF
   ```

   執行下列命令以套用至 EKS 叢集。

   ```
   kubectl apply -f mlflow-service-account.yaml
   ```

1. 建立 Pod 身分識別關聯。

   ```
   aws eks create-pod-identity-association \
       --cluster-name EKS_CLUSTER_NAME \
       --role-arn arn:aws:iam::111122223333:role/hyperpod-mlflow-role \
       --namespace kubeflow \
       --service-account mlflow-service-account
   ```

**將訓練任務中的指標收集到 MLflow 伺服器**

資料科學家需要設定訓練指令碼和 docker 映像，以向 MLflow 伺服器發送指標。

1. 在訓練指令碼的開頭新增以下幾行。

   ```
   import mlflow
   
   # Set the Tracking Server URI using the ARN of the Tracking Server you created
   mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN'])
   # Enable autologging in MLflow
   mlflow.autolog()
   ```

1. 使用訓練指令碼建置 Docker 映像檔並推送至 Amazon ECR。取得 ECR 容器的 ARN。如需建置和推送 Docker 映像檔的詳細資訊，請參閱《ECR 使用者指南》**中的[推送 Docker 映像檔](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html)。
**提示**  
確定在 Docker 檔案中新增 mlflow 和 sagemaker-mlflow 套件的安裝。若要進一步了解套件的安裝、需求和套件的相容版本，請參閱[安裝 MLflow 和 SageMaker AI MLflow 外掛程式](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-track-experiments.html#mlflow-track-experiments-install-plugin)。

1. 在訓練任務 Pod 中新增服務帳戶，讓他們可以存取 `hyperpod-mlflow-role`。這可讓 Pod 呼叫 MLflow API。執行下列 SageMaker HyperPod CLI 任務提交範本。使用檔案名稱 `mlflow-test.yaml` 建立此項目。

   ```
   defaults:
    - override hydra/job_logging: stdout
   
   hydra:
    run:
     dir: .
    output_subdir: null
   
   training_cfg:
    entry_script: ./train.py
    script_args: []
    run:
     name: test-job-with-mlflow # Current run name
     nodes: 2 # Number of nodes to use for current training
     # ntasks_per_node: 1 # Number of devices to use per node
   cluster:
    cluster_type: k8s # currently k8s only
    instance_type: ml.c5.2xlarge
    cluster_config:
     # name of service account associated with the namespace
     service_account_name: mlflow-service-account
     # persistent volume, usually used to mount FSx
     persistent_volume_claims: null
     namespace: kubeflow
     # required node affinity to select nodes with SageMaker HyperPod
     # labels and passed health check if burn-in enabled
     label_selector:
         required:
             sagemaker.amazonaws.com/node-health-status:
                 - Schedulable
         preferred:
             sagemaker.amazonaws.com/deep-health-check-status:
                 - Passed
         weights:
             - 100
     pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never
     restartPolicy: OnFailure # restart policy
   
   base_results_dir: ./result # Location to store the results, checkpoints and logs.
   container: 111122223333.dkr.ecr.us-west-2.amazonaws.com/tag # container to use
   
   env_vars:
    NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information
    MLFLOW_TRACKING_ARN: arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name
   ```

1. 使用 YAML 檔案啟動任務，如下所示。

   ```
   hyperpod start-job --config-file /path/to/mlflow-test.yaml
   ```

1. 為 MLflow 追蹤伺服器產生預先簽章的 URL。您可以在瀏覽器上開啟連結，並開始追蹤您的訓練任務。

   ```
   aws sagemaker create-presigned-mlflow-tracking-server-url \                          
       --tracking-server-name "tracking-server-name" \
       --session-expiration-duration-in-seconds 1800 \
       --expires-in-seconds 300 \
       --region region
   ```