使用訓練運算子執行任務

若要使用 kubectl 執行任務，您必須建立 job.yaml 以指定任務規格，並執行 kubectl apply -f job.yaml 以提交任務。在此 YAML 檔案中，您可以在 logMonitoringConfiguration 引數中指定自訂組態，以定義自動監控規則，分析分散式訓練任務的日誌輸出，以偵測問題並進行復原。


apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    app.kubernetes.io/name: HyperPod
    app.kubernetes.io/managed-by: kustomize
  name: &jobname xxx
  annotations:
    XXX: XXX
    ......
spec:
  nprocPerNode: "X"
  replicaSpecs:
    - name: 'XXX'
      replicas: 16
      template:
        spec:
          nodeSelector:
            beta.kubernetes.io/instance-type: ml.p5.48xlarge
          containers:
            - name: XXX
              image: XXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080 # This is the port that HyperPodElasticAgent listens to
              resources:
                limits:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                requests:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                  memory: 32000Mi
          ......        
  runPolicy:
    jobMaxRetryCount: 50
    restartPolicy:
      numRestartBeforeFullJobRestart: 3 
      evalPeriodSeconds: 21600 
      maxFullJobRestarts: 1
    cleanPodPolicy: "All"
    logMonitoringConfiguration: 
      - name: "JobStart"
        logPattern: ".*Experiment configuration.*" # This is the start of the training script
        expectedStartCutOffInSeconds: 120 # Expected match in the first 2 minutes
      - name: "JobHangingDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'training_loss_step': (\\d+(\\.\\d+)?).*"
        expectedRecurringFrequencyInSeconds: 300 # If next batch is not printed within 5 minute, consider it hangs. Or if loss is not decimal (e.g. nan) for 2 minutes, mark it hang as well.
        expectedStartCutOffInSeconds: 600 # Allow 10 minutes of job startup time
      - name: "NoS3CheckpointingDetection"
        logPattern: ".*The checkpoint is finalized. All shards is written.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint s3 upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 1800 # Allow 30 minutes for first checkpoint upload
      - name: "LowThroughputDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'samples\\/sec': (\\d+(\\.\\d+)?).*"
        metricThreshold: 80 # 80 samples/sec
        operator: "lteq"
        metricEvaluationDataPoints: 25 # if throughput lower than threshold for 25 datapoints, kill the job

如果您想要使用日誌監控選項，請確定您要將訓練日誌發向 sys.stdout。HyperPod 彈性代理程式會監控 sys.stdout 中的訓練日誌，其儲存在 /tmp/hyperpod/。您可以使用下列命令來發出訓練日誌。


logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout)

下表描述所有可能的日誌監控組態：

參數	Usage
jobMaxRetryCount	程序層級的重新啟動次數上限。
restartPolicy: numRestartBeforeFullJobRestart	在運算子於任務層級重新啟動之前，在程序層級重新啟動的次數上限。
restartPolicy: evalPeriodSeconds	評估重新啟動限制的期間，以秒為單位
restartPolicy: maxFullJobRestarts	完整任務在任務失敗之前重新啟動的數量上限。
cleanPodPolicy	指定運算子應清除的 Pod。接受值為 `All`、`OnlyComplete` 和 `None`。
logMonitoringConfiguration	用於慢速和當掉任務偵測的日誌監控規則
expectedRecurringFrequencyInSeconds	兩個連續 LogPattern 比對之間的時間間隔，在此時間間隔之後規則就會評估為 HANGING。如果未指定，則在連續 LogPattern 比對之間不存在時間限制。
expectedStartCutOffInSeconds	第一個 LogPattern 比對的時間，在此時間之後規則就會評估為 HANGING。如果未指定，則第一個 LogPattern 比對不存在時間限制。
logPattern	規則運算式，用於識別規則在作用中時套用至其中的日誌行
metricEvaluationDataPoints	在將任務標記為 SLOW 之前，規則必須評估為 SLOW 的連續次數。如果未指定，則預設值為 1。
metricThreshold	LogPattern 使用擷取群組擷取之值的閾值。如果未指定，則不會執行指標評估。
operator	要套用至監控組態的不等式。接受值為 `gt`、`gteq`、`lt`、`lteq` 和 `eq`。
stopPattern	規則運算式，用來識別要在其中停用規則的日誌行。如果未指定，則規則將一律為作用中。
faultOnMatch	指出 LogPattern 的相符項目是否應立即觸發任務錯誤。為 true 時，無論其他規則參數為何，只要 LogPattern 相符，任務就會標示為故障。當 false 或未指定時，規則會根據其他參數評估為 SLOW 或 HANGING。

如需更多訓練彈性，請指定備用節點組態詳細資訊。如果您的任務失敗，運算子會與 Kueue 合作，使用預先預留的節點以繼續執行任務。備用節點組態需要 Kueue，因此如果您嘗試提交一個具有備用節點的任務，但未安裝 Kueue，任務將會失敗。下列範例是包含備用節點組態的範例 job.yaml 檔案。



apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue # Specify the queue to run the job.
  name: hyperpodpytorchjob-sample
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 1
      spares: 1 # Specify how many spare nodes to reserve.
      template:
        spec:
          containers:
            - name: XXX
              image: XXX
              
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  nvidia.com/gpu: "0"
                limits:
                  nvidia.com/gpu: "0"

監控

Amazon SageMaker HyperPod 與具有 Amazon Managed Grafana 和 Amazon Managed Service for Prometheus 的可觀測性整合，因此您可以設定監控，以收集指標並將其饋送至這些可觀測性工具。

或者，您可以在沒有受管可觀測性的情況下透過 Amazon Managed Service for Prometheus 來湊集指標。若要這樣做，請在使用 kubectl 執行任務時，將您要監控的指標納入 job.yaml 檔案。


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hyperpod-training-operator
  namespace: aws-hyperpod
spec:
  ......
  endpoints:
    - port: 8081
      path: /metrics
      interval: 15s

以下是訓練運算子發出的事件，您可以將其饋送至 Amazon Managed Service for Prometheus 以監控您的訓練任務。

事件	Description
hyperpod_training_operator_jobs_created_total	訓練運算子已執行的任務總數
hyperpod_training_operator_jobs_restart_latency	目前任務重新啟動延遲
hyperpod_training_operator_jobs_fault_detection_latency	故障偵測延遲
hyperpod_training_operator_jobs_deleted_total	已刪除的任務總數
hyperpod_training_operator_jobs_successful_total	已完成的任務總數
hyperpod_training_operator_jobs_failed_total	失敗的任務總數
hyperpod_training_operator_jobs_restarted_total	自動重新啟動的任務總數

範例 docker 組態

以下是您可以使用 hyperpod run 命令執行的範例 docker 檔案。


export AGENT_CMD="--backend=nccl"
exec hyperpodrun --server-host=${AGENT_HOST} --server-port=${AGENT_PORT} \
    --tee=3 --log_dir=/tmp/hyperpod \
    --nnodes=${NNODES} --nproc-per-node=${NPROC_PER_NODE} \
    --pre-train-script=/workspace/echo.sh --pre-train-args='Pre-training script' \
    --post-train-script=/workspace/echo.sh --post-train-args='Post-training script' \
    /workspace/mnist.py --epochs=1000 ${AGENT_CMD}

日誌監控組態範例

任務當掉偵測

若要偵測當掉任務，請使用下列組態。其會使用下列參數：

expectedStartCutOffInSeconds – 在預期第一個日誌之前監視器應該等待多長時間
expectedRecurringFrequencyInSeconds – 等待下一批次日誌的時間間隔

使用這些設定，日誌監視器預期在訓練任務開始後 60 秒內看到符合規則運算式模式 .*Train Epoch.* 的日誌行。第一次出現後，監視器預期每 10 秒會看到相符的日誌行。如果第一個日誌未在 60 秒內出現，或後續日誌未每 10 秒出現一次，HyperPod 彈性代理程式會將容器視為卡住，並與訓練運算子協調以重新啟動任務。


runPolicy:
    jobMaxRetryCount: 10
    cleanPodPolicy: "None"
    logMonitoringConfiguration:
      - name: "JobStartGracePeriod"
        # Sample log line: [default0]:2025-06-17 05:51:29,300 [INFO] __main__: Train Epoch: 5 [0/60000 (0%)]       loss=0.8470
        logPattern: ".*Train Epoch.*"  
        expectedStartCutOffInSeconds: 60 
      - name: "JobHangingDetection"
        logPattern: ".*Train Epoch.*"
        expectedRecurringFrequencyInSeconds: 10 # if the next batch is not printed within 10 seconds

訓練損失峰值

下列監控組態會發出模式為 xxx training_loss_step xx 的訓練日誌。它會使用參數 metricEvaluationDataPoints，讓您可在運算子重新啟動任務之前指定資料點的閾值。如果訓練損失值超過 2.0，運算子會重新啟動任務。


runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "LossSpikeDetection"
      logPattern: ".*training_loss_step (\\d+(?:\\.\\d+)?).*"   # training_loss_step 5.0
      metricThreshold: 2.0
      operator: "gt"
      metricEvaluationDataPoints: 5 # if loss higher than threshold for 5 data points, restart the job

低 TFLOP 偵測

下列監控組態每 5 秒就會發出模式為 xx TFLOPs xx 的訓練日誌。如果 5 個資料點的 TFLOP 小於 100，則運算子會重新啟動訓練任務。


runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "TFLOPs"
      logPattern: ".* (.+)TFLOPs.*"    # Training model, speed: X TFLOPs...
      expectedRecurringFrequencyInSeconds: 5        
      metricThreshold: 100       # if Tflops is less than 100 for 5 data points, restart the job       
      operator: "lt"
      metricEvaluationDataPoints: 5

訓練指令碼錯誤日誌偵測

下列監控組態會偵測訓練日誌logPattern中是否存在中指定的模式。一旦訓練運算子遇到錯誤模式，訓練運算子就會將其視為錯誤並重新啟動任務。


runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "GPU Error"
      logPattern: ".*RuntimeError.*out of memory.*"
      faultOnMatch: true

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

安裝

疑難排解