Amazon EKS 協調的 SageMaker HyperPod 叢集上訓練任務的模型可觀測性 - Amazon SageMaker AI

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

Amazon EKS 協調的 SageMaker HyperPod 叢集上訓練任務的模型可觀測性

與 Amazon EKS 協調的 SageMaker HyperPod 叢集可與 Amazon SageMaker Studio 上的 MLflow 應用程式整合。叢集管理員會設定 MLflow 伺服器,並將其與 SageMaker HyperPod 叢集連線。資料科學家可以深入了解模型。

使用 CLI AWS 設定 MLflow 伺服器

叢集管理員必須建立 MLflow 追蹤伺服器。

  1. 依照使用 CLI 建立追蹤伺服器的指示,建立 SageMaker AI MLflow 追蹤伺服器。 AWS

  2. 確定 eks-auth:AssumeRoleForPodIdentity 許可存在於 SageMaker HyperPod 的 IAM 執行角色中。

  3. 如果尚未在您的 EKS 叢集上安裝 eks-pod-identity-agent 附加元件,請在 EKS 叢集上安裝附加元件。

    aws eks create-addon \ --cluster-name <eks_cluster_name> \ --addon-name eks-pod-identity-agent \ --addon-version vx.y.z-eksbuild.1
  4. 為 Pod 的新角色建立 trust-relationship.json 檔案,以呼叫 MLflow API。

    cat >trust-relationship.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowEksAuthToAssumeRoleForPodIdentity", "Effect": "Allow", "Principal": { "Service": "pods.eks.amazonaws.com" }, "Action": [ "sts:AssumeRole", "sts:TagSession" ] } ] } EOF

    執行下列命令,以建立角色並連接信任關係。

    aws iam create-role --role-name hyperpod-mlflow-role \ --assume-role-policy-document file://trust-relationship.json \ --description "allow pods to emit mlflow metrics and put data in s3"
  5. 建立下列政策,授予 Pod 可以呼叫所有 sagemaker-mlflow 操作和將模型成品放入 S3 中。S3 許可已存在於追蹤伺服器內,但如果模型成品太大,則會從 MLflow 程式碼進行對 s3 的直接呼叫以上傳成品。

    cat >hyperpod-mlflow-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sagemaker-mlflow:AccessUI", "sagemaker-mlflow:CreateExperiment", "sagemaker-mlflow:SearchExperiments", "sagemaker-mlflow:GetExperiment", "sagemaker-mlflow:GetExperimentByName", "sagemaker-mlflow:DeleteExperiment", "sagemaker-mlflow:RestoreExperiment", "sagemaker-mlflow:UpdateExperiment", "sagemaker-mlflow:CreateRun", "sagemaker-mlflow:DeleteRun", "sagemaker-mlflow:RestoreRun", "sagemaker-mlflow:GetRun", "sagemaker-mlflow:LogMetric", "sagemaker-mlflow:LogBatch", "sagemaker-mlflow:LogModel", "sagemaker-mlflow:LogInputs", "sagemaker-mlflow:SetExperimentTag", "sagemaker-mlflow:SetTag", "sagemaker-mlflow:DeleteTag", "sagemaker-mlflow:LogParam", "sagemaker-mlflow:GetMetricHistory", "sagemaker-mlflow:SearchRuns", "sagemaker-mlflow:ListArtifacts", "sagemaker-mlflow:UpdateRun", "sagemaker-mlflow:CreateRegisteredModel", "sagemaker-mlflow:GetRegisteredModel", "sagemaker-mlflow:RenameRegisteredModel", "sagemaker-mlflow:UpdateRegisteredModel", "sagemaker-mlflow:DeleteRegisteredModel", "sagemaker-mlflow:GetLatestModelVersions", "sagemaker-mlflow:CreateModelVersion", "sagemaker-mlflow:GetModelVersion", "sagemaker-mlflow:UpdateModelVersion", "sagemaker-mlflow:DeleteModelVersion", "sagemaker-mlflow:SearchModelVersions", "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts", "sagemaker-mlflow:TransitionModelVersionStage", "sagemaker-mlflow:SearchRegisteredModels", "sagemaker-mlflow:SetRegisteredModelTag", "sagemaker-mlflow:DeleteRegisteredModelTag", "sagemaker-mlflow:DeleteModelVersionTag", "sagemaker-mlflow:DeleteRegisteredModelAlias", "sagemaker-mlflow:SetRegisteredModelAlias", "sagemaker-mlflow:GetModelVersionByAlias" ], "Resource": "arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>" }, { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>" } ] } EOF
    注意

    這些·ARN 來自 MLflow 伺服器,以及在您建立 MLflow 伺服器期間遵循設定 MLflow 基礎設施中的指示,使用該伺服器設定的 S3 儲存貯體。

  6. 使用上一個步驟中儲存的政策文件,將 mlflow-metrics-emit-policy 政策連接至 hyperpod-mlflow-role

    aws iam put-role-policy \ --role-name hyperpod-mlflow-role \ --policy-name mlflow-metrics-emit-policy \ --policy-document file://hyperpod-mlflow-policy.json
  7. 為 Pod 建立 Kubernetes 服務帳戶以存取 MLflow 伺服器。

    cat >mlflow-service-account.yaml <<EOF apiVersion: v1 kind: ServiceAccount metadata: name: mlflow-service-account namespace: kubeflow EOF

    執行下列命令以套用至 EKS 叢集。

    kubectl apply -f mlflow-service-account.yaml
  8. 建立 Pod 身分識別關聯。

    aws eks create-pod-identity-association \ --cluster-name EKS_CLUSTER_NAME \ --role-arn arn:aws:iam::111122223333:role/hyperpod-mlflow-role \ --namespace kubeflow \ --service-account mlflow-service-account

將訓練任務中的指標收集到 MLflow 伺服器

資料科學家需要設定訓練指令碼和 docker 映像,以向 MLflow 伺服器發送指標。

  1. 在訓練指令碼的開頭新增以下幾行。

    import mlflow # Set the Tracking Server URI using the ARN of the Tracking Server you created mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN']) # Enable autologging in MLflow mlflow.autolog()
  2. 使用訓練指令碼建置 Docker 映像檔並推送至 Amazon ECR。取得 ECR 容器的 ARN。如需建置和推送 Docker 映像檔的詳細資訊,請參閱《ECR 使用者指南》中的推送 Docker 映像檔

    提示

    確定在 Docker 檔案中新增 mlflow 和 sagemaker-mlflow 套件的安裝。若要進一步了解套件的安裝、需求和套件的相容版本,請參閱安裝 MLflow 和 SageMaker AI MLflow 外掛程式

  3. 在訓練任務 Pod 中新增服務帳戶,讓他們可以存取 hyperpod-mlflow-role。這可讓 Pod 呼叫 MLflow API。執行下列 SageMaker HyperPod CLI 任務提交範本。使用檔案名稱 mlflow-test.yaml 建立此項目。

    defaults: - override hydra/job_logging: stdout hydra: run: dir: . output_subdir: null training_cfg: entry_script: ./train.py script_args: [] run: name: test-job-with-mlflow # Current run name nodes: 2 # Number of nodes to use for current training # ntasks_per_node: 1 # Number of devices to use per node cluster: cluster_type: k8s # currently k8s only instance_type: ml.c5.2xlarge cluster_config: # name of service account associated with the namespace service_account_name: mlflow-service-account # persistent volume, usually used to mount FSx persistent_volume_claims: null namespace: kubeflow # required node affinity to select nodes with SageMaker HyperPod # labels and passed health check if burn-in enabled label_selector: required: sagemaker.amazonaws.com/node-health-status: - Schedulable preferred: sagemaker.amazonaws.com/deep-health-check-status: - Passed weights: - 100 pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never restartPolicy: OnFailure # restart policy base_results_dir: ./result # Location to store the results, checkpoints and logs. container: 111122223333.dkr.ecr.us-west-2.amazonaws.com/tag # container to use env_vars: NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information MLFLOW_TRACKING_ARN: arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name
  4. 使用 YAML 檔案啟動任務,如下所示。

    hyperpod start-job --config-file /path/to/mlflow-test.yaml
  5. 為 MLflow 追蹤伺服器產生預先簽章的 URL。您可以在瀏覽器上開啟連結,並開始追蹤您的訓練任務。

    aws sagemaker create-presigned-mlflow-tracking-server-url \ --tracking-server-name "tracking-server-name" \ --session-expiration-duration-in-seconds 1800 \ --expires-in-seconds 300 \ --region region