Amazon EKS 協調的 SageMaker HyperPod 叢集上訓練任務的模型可觀測性 - Amazon SageMaker AI

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

Amazon EKS 協調的 SageMaker HyperPod 叢集上訓練任務的模型可觀測性

與 Amazon EKS 協調的 SageMaker HyperPod 叢集可與 Amazon SageMaker Studio 上的 MLflow 應用程式整合。叢集管理員會設定 MLflow 伺服器,並將其連接到 SageMaker HyperPod 叢集。資料科學家可以深入了解模型。

使用 CLI AWS 設定 MLflow 伺服器

叢集管理員必須建立 MLflow 追蹤伺服器。

  1. 依照使用 CLI 建立追蹤伺服器的指示,建立 SageMaker AI MLflow 追蹤伺服器。 AWS

  2. 確定eks-auth:AssumeRoleForPodIdentity許可存在於 SageMaker HyperPod 的 IAM 執行角色中。

  3. 如果您的 EKS 叢集尚未安裝eks-pod-identity-agent附加元件,請在 EKS 叢集上安裝附加元件。

    aws eks create-addon \ --cluster-name <eks_cluster_name> \ --addon-name eks-pod-identity-agent \ --addon-version vx.y.z-eksbuild.1
  4. 為 Pod 的新角色建立 trust-relationship.json 檔案,以呼叫 MLflow APIs。

    cat >trust-relationship.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowEksAuthToAssumeRoleForPodIdentity", "Effect": "Allow", "Principal": { "Service": "pods.eks.amazonaws.com" }, "Action": [ "sts:AssumeRole", "sts:TagSession" ] } ] } EOF

    執行下列程式碼來建立角色並連接信任關係。

    aws iam create-role --role-name hyperpod-mlflow-role \ --assume-role-policy-document file://trust-relationship.json \ --description "allow pods to emit mlflow metrics and put data in s3"
  5. 建立下列政策,授予 Pod 呼叫所有sagemaker-mlflow操作和將模型成品放入 S3 的存取權。S3 許可已存在於追蹤伺服器中,但如果模型成品對 s3 的直接呼叫太大,則會從 MLflow 程式碼上傳成品。

    cat >hyperpod-mlflow-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sagemaker-mlflow:AccessUI", "sagemaker-mlflow:CreateExperiment", "sagemaker-mlflow:SearchExperiments", "sagemaker-mlflow:GetExperiment", "sagemaker-mlflow:GetExperimentByName", "sagemaker-mlflow:DeleteExperiment", "sagemaker-mlflow:RestoreExperiment", "sagemaker-mlflow:UpdateExperiment", "sagemaker-mlflow:CreateRun", "sagemaker-mlflow:DeleteRun", "sagemaker-mlflow:RestoreRun", "sagemaker-mlflow:GetRun", "sagemaker-mlflow:LogMetric", "sagemaker-mlflow:LogBatch", "sagemaker-mlflow:LogModel", "sagemaker-mlflow:LogInputs", "sagemaker-mlflow:SetExperimentTag", "sagemaker-mlflow:SetTag", "sagemaker-mlflow:DeleteTag", "sagemaker-mlflow:LogParam", "sagemaker-mlflow:GetMetricHistory", "sagemaker-mlflow:SearchRuns", "sagemaker-mlflow:ListArtifacts", "sagemaker-mlflow:UpdateRun", "sagemaker-mlflow:CreateRegisteredModel", "sagemaker-mlflow:GetRegisteredModel", "sagemaker-mlflow:RenameRegisteredModel", "sagemaker-mlflow:UpdateRegisteredModel", "sagemaker-mlflow:DeleteRegisteredModel", "sagemaker-mlflow:GetLatestModelVersions", "sagemaker-mlflow:CreateModelVersion", "sagemaker-mlflow:GetModelVersion", "sagemaker-mlflow:UpdateModelVersion", "sagemaker-mlflow:DeleteModelVersion", "sagemaker-mlflow:SearchModelVersions", "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts", "sagemaker-mlflow:TransitionModelVersionStage", "sagemaker-mlflow:SearchRegisteredModels", "sagemaker-mlflow:SetRegisteredModelTag", "sagemaker-mlflow:DeleteRegisteredModelTag", "sagemaker-mlflow:DeleteModelVersionTag", "sagemaker-mlflow:DeleteRegisteredModelAlias", "sagemaker-mlflow:SetRegisteredModelAlias", "sagemaker-mlflow:GetModelVersionByAlias" ], "Resource": "arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>" }, { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>" } ] } EOF
    注意

    ARNs 是來自 MLflow 伺服器和 S3 儲存貯體的 ARN,該儲存貯體會在您依照設定 MLflow 基礎設施指示建立的伺服器期間使用 MLflow 伺服器進行設定

  6. hyperpod-mlflow-role 使用上一個步驟中儲存的政策文件,將mlflow-metrics-emit-policy政策連接至 。

    aws iam put-role-policy \ --role-name hyperpod-mlflow-role \ --policy-name mlflow-metrics-emit-policy \ --policy-document file://hyperpod-mlflow-policy.json
  7. 為 Pod 建立 Kubernetes 服務帳戶以存取 MLflow 伺服器。

    cat >mlflow-service-account.yaml <<EOF apiVersion: v1 kind: ServiceAccount metadata: name: mlflow-service-account namespace: kubeflow EOF

    執行下列命令以套用至 EKS 叢集。

    kubectl apply -f mlflow-service-account.yaml
  8. 建立 Pod 身分關聯。

    aws eks create-pod-identity-association \ --cluster-name EKS_CLUSTER_NAME \ --role-arn arn:aws:iam::111122223333:role/hyperpod-mlflow-role \ --namespace kubeflow \ --service-account mlflow-service-account

從訓練任務收集指標到 MLflow 伺服器

資料科學家需要設定訓練指令碼和 docker 映像,將指標發射到 MLflow 伺服器。

  1. 在訓練指令碼的開頭新增以下行。

    import mlflow # Set the Tracking Server URI using the ARN of the Tracking Server you created mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN']) # Enable autologging in MLflow mlflow.autolog()
  2. 使用訓練指令碼建置 Docker 映像並推送至 Amazon ECR。取得 ECR 容器的 ARN。如需建置和推送 Docker 映像的詳細資訊,請參閱《ECR 使用者指南》中的推送 Docker 映像

    提示

    請務必在 Docker 檔案中新增 mlflow 和 sagemaker-mlflow 套件的安裝。若要進一步了解套件的安裝、需求和套件的相容版本,請參閱安裝 MLflow 和 SageMaker AI MLflow 外掛程式

  3. 在訓練任務 Pod 中新增服務帳戶,讓他們能夠存取 hyperpod-mlflow-role。這可讓 Pod 呼叫 MLflow APIs。執行下列 SageMaker HyperPod CLI 任務提交範本。使用檔案名稱 建立此項目mlflow-test.yaml

    defaults: - override hydra/job_logging: stdout hydra: run: dir: . output_subdir: null training_cfg: entry_script: ./train.py script_args: [] run: name: test-job-with-mlflow # Current run name nodes: 2 # Number of nodes to use for current training # ntasks_per_node: 1 # Number of devices to use per node cluster: cluster_type: k8s # currently k8s only instance_type: ml.c5.2xlarge cluster_config: # name of service account associated with the namespace service_account_name: mlflow-service-account # persistent volume, usually used to mount FSx persistent_volume_claims: null namespace: kubeflow # required node affinity to select nodes with SageMaker HyperPod # labels and passed health check if burn-in enabled label_selector: required: sagemaker.amazonaws.com/node-health-status: - Schedulable preferred: sagemaker.amazonaws.com/deep-health-check-status: - Passed weights: - 100 pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never restartPolicy: OnFailure # restart policy base_results_dir: ./result # Location to store the results, checkpoints and logs. container: 111122223333.dkr.ecr.us-west-2.amazonaws.com/tag # container to use env_vars: NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information MLFLOW_TRACKING_ARN: arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name
  4. 使用 YAML 檔案啟動任務,如下所示。

    hyperpod start-job --config-file /path/to/mlflow-test.yaml
  5. 產生 MLflow 追蹤伺服器的預先簽章 URL。您可以在瀏覽器上開啟連結,並開始追蹤您的訓練任務。

    aws sagemaker create-presigned-mlflow-tracking-server-url \ --tracking-server-name "tracking-server-name" \ --session-expiration-duration-in-seconds 1800 \ --expires-in-seconds 300 \ --region region