在部署 YAML 中使用 autoScalingSpec 透過 kubectl 使用 KEDA ScaledObject yaml 定義用於縮減至 0 個 Pod 的範例 Prometheus 政策

HyperPod 推論模型部署的自動擴展政策

下列資訊提供在 Amazon SageMaker HyperPod 推論模型部署上實作自動擴展政策的實際範例和組態。

您將了解如何使用部署 YAML 檔案中的內建 autoScalingSpec來設定自動擴展，以及如何為進階擴展案例建立獨立的 KEDA ScaledObject組態。這些範例涵蓋根據 CloudWatch 指標、Amazon SQS 佇列長度、Prometheus 查詢和資源使用率指標的擴展觸發條件，例如 CPU 和記憶體。

在部署 YAML 中使用 autoScalingSpec

Amazon SageMaker HyperPod 推論運算子使用來自 CloudWatch 和 Amazon Managed Prometheus (AMP) 的指標，為模型部署提供內建的自動擴展功能。下列部署 YAML 範例包含 autoScalingSpec區段，定義擴展模型部署的組態值。


apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
kind: JumpStartModel
metadata:
  name: deepseek-sample624
  namespace: ns-team-a
spec:
  sageMakerEndpoint:
    name: deepsek7bsme624
  model:
    modelHubName: SageMakerPublicHub
    modelId: deepseek-llm-r1-distill-qwen-1-5b
    modelVersion: 2.0.4
  server:
    instanceType: ml.g5.8xlarge
  metrics:
    enabled: true
  environmentVariables:
    - name: SAMPLE_ENV_VAR
      value: "sample_value"
  maxDeployTimeInSeconds: 1800
  tlsConfig:
    tlsCertificateOutputS3Uri: "s3://{USER}-tls-bucket-{REGION}/certificates"
  autoScalingSpec:
    minReplicaCount: 0
    maxReplicaCount: 5
    pollingInterval: 15
    initialCooldownPeriod: 60
    cooldownPeriod: 120
    scaleDownStabilizationTime: 60
    scaleUpStabilizationTime: 0
    cloudWatchTrigger:
        name: "SageMaker-Invocations"
        namespace: "AWS/SageMaker"
        useCachedMetrics: false
        metricName: "Invocations"
        targetValue: 10.5
        activationTargetValue: 5.0
        minValue: 0.0
        metricCollectionStartTime: 300
        metricCollectionPeriod: 30
        metricStat: "Sum"
        metricType: "Average"
        dimensions:
          - name: "EndpointName"
            value: "deepsek7bsme624"
          - name: "VariantName"
            value: "AllTraffic"
    prometheusTrigger: 
        name: "Prometheus-Trigger"
        useCachedMetrics: false
        serverAddress: http://<prometheus-host>:9090
        query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
        targetValue: 10.0
        activationTargetValue: 5.0
        namespace: "namespace"
        customHeaders: "X-Client-Id=cid"
        metricType: "Value"

部署 YAML 中使用的欄位說明

minReplicaCount （選用，整數）

指定要在叢集中維護的模型部署複本數目下限。在縮減規模事件期間，部署會縮減至此最低數量的 Pod。必須大於或等於 0。預設：1。

maxReplicaCount （選用，整數）

指定要在叢集中維護的模型部署複本數目上限。必須大於或等於 minReplicaCount。在擴展事件期間，部署會擴展到此 Pod 數量上限。預設：5。

pollingInterval （選用，整數）

查詢指標的時間間隔，以秒為單位。下限：0。預設：30秒。

cooldownPeriod （選用、整數）

縮減事件期間從 1 縮減至 0 個 Pod 之前等待的時間間隔，以秒為單位。只有在 minReplicaCount 設定為 0 時才適用。下限：0。預設：300 秒。

initialCooldownPeriod （選用、整數）

在初始部署期間，從 1 縮減到 0 個 Pod 之前等待的時間間隔，以秒為單位。只有在 minReplicaCount 設定為 0 時才適用。下限：0。預設：300 秒。

scaleDownStabilizationTime （選用，整數）

縮減規模觸發啟動後，在縮減規模發生之前，以秒為單位的穩定時段。下限：0。預設：300 秒。

scaleUpStabilizationTime （選用、整數）

向上擴展觸發啟動後，在向上擴展發生之前，以秒為單位的穩定時段。下限：0。預設：0 秒。

cloudWatchTrigger

用於自動擴展決策的 CloudWatch 指標觸發組態。下列欄位可在中使用cloudWatchTrigger：

name （選用，字串） - CloudWatch 觸發器的名稱。如果未提供，會使用預設格式：<model-deployment-name>-scaled-object-cloudwatch-trigger。
useCachedMetrics （選用，布林值） - 決定是否快取 KEDA 查詢的指標。KEDA 使用 pollingInterval 查詢指標，而 Horizontal Pod Autoscaler (HPA) 每 15 秒向 KEDA 請求指標。設為 true 時，系統會快取查詢的指標，並用於處理 HPA 請求。預設：true。
namespace （必要，字串） - 要查詢之指標的 CloudWatch 命名空間。
metricName （必要，字串） - CloudWatch 指標的名稱。
dimensions （選用，列出） - 指標的維度清單。每個維度都包含名稱（維度名稱 - 字串）和值（維度值 - 字串）。
targetValue （必要，浮點數） - 用於自動調整規模決策的 CloudWatch 指標目標值。
activationTargetValue （選用，浮動） - 從 0 擴展到 1 個 Pod 時所使用的 CloudWatch 指標目標值。只有在 minReplicaCount 設為 0 時才適用。預設：0.
minValue （選用，浮動） - CloudWatch 查詢未傳回資料時要使用的值。預設：0.
metricCollectionStartTime （選用，整數） - 指標查詢的開始時間，計算方式為 T-metricCollectionStartTime。必須大於或等於 metricCollectionPeriod。預設：300 秒。
metricCollectionPeriod （選用，整數） - 指標查詢的持續時間，以秒為單位。必須是 CloudWatch 支援的值 (1、5、10、30 或 60 的倍數）。預設：300 秒。
metricStat （選用，字串） - CloudWatch 查詢的統計資料類型。預設：Average。
metricType （選用，字串） - 定義如何將指標用於擴展計算。預設：Average。允許的值：Average、Value。
- 平均值：所需複本 = ceil （指標值） / (targetValue)
- 值：所需複本 = （目前複本） × ceil （指標值） / (targetValue)

prometheusTrigger

用於自動擴展決策的 Amazon Managed Prometheus (AMP) 指標觸發組態。下列欄位可在中使用prometheusTrigger：

name （選用，字串） - CloudWatch 觸發器的名稱。如果未提供，會使用預設格式：<model-deployment-name>-scaled-object-cloudwatch-trigger。
useCachedMetrics （選用，布林值） - 決定是否快取 KEDA 查詢的指標。KEDA 使用 pollingInterval 查詢指標，而 Horizontal Pod Autoscaler (HPA) 每 15 秒向 KEDA 請求指標。設為 true 時，系統會快取查詢的指標，並用於處理 HPA 請求。預設：true。
serverAddress （必要，字串） - AMP 伺服器的地址。必須使用格式：<https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace_id>
query （必要，字串） - 用於指標的 PromQL 查詢。必須傳回純量值。
targetValue （必要，浮點數） - 用於自動調整規模決策的 CloudWatch 指標目標值。
activationTargetValue （選用，浮動） - 從 0 擴展到 1 個 Pod 時所使用的 CloudWatch 指標目標值。只有在 minReplicaCount 設為 0 時才適用。預設：0.
namespace （選用，字串） - 用於命名空間查詢的命名空間。預設：空字串 ("")。
customHeaders （選用，字串） - 查詢 Prometheus 端點時要包含的自訂標頭。預設：空字串 ("")。
metricType （選用，字串） - 定義如何將指標用於擴展計算。預設：Average。允許的值：Average、Value。
- 平均值：所需複本 = Ceil （指標值） / (targetValue)
- 值：所需複本 = （目前複本） × ceil （指標值） / (targetValue)

透過 kubectl 使用 KEDA ScaledObject yaml 定義

除了透過部署 YAML 中的 autoScalingSpec 區段設定自動擴展之外，您還可以使用 kubectl 建立和套用獨立 KEDA ScaledObject YAML 定義。

此方法可為複雜的擴展案例提供更大的彈性，並可讓您獨立於模型部署管理自動擴展政策。KEDA ScaledObject組態支援各種擴展觸發，包括 CloudWatch 指標、Amazon SQS 佇列長度、Prometheus 查詢，以及 CPU 和記憶體使用率等以資源為基礎的指標。您可以在 ScaledObject 規格的 scaleTargetRef 區段中參考部署名稱，將這些組態套用至現有的模型部署。

注意

確保在 HyperPod 推論運算子安裝期間提供的 keda 運算子角色具有足夠的許可來查詢擴展物件觸發條件中定義的指標。

CloudWatch 指標

下列 KEDA yaml 政策使用 CloudWatch 指標作為觸發條件，以在 kubernetes 部署上執行自動擴展。此政策會查詢 Sagemaker 端點的叫用次數，並擴展部署 Pod 的數量。您可以在 https://keda.sh/docs/2.17/scalers/aws-cloudwatch/：// 找到 KEDA 支援的aws-cloudwatch觸發參數完整清單。


apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: aws-cloudwatch
    metadata:
      namespace: AWS/SageMaker
      metricName: Invocations
      targetMetricValue: "1"
      minMetricValue: "1"
      awsRegion: "us-west-2"
      dimensionName: EndpointName;VariantName
      dimensionValue: $ENDPOINT_NAME;$VARIANT_NAME
      metricStatPeriod: "30" # seconds
      metricStat: "Sum"
      identityOwner: operator

Amazon SQS 指標

下列 KEDA yaml 政策使用 Amazon SQS 指標作為觸發條件，以在 kubernetes 部署上執行自動擴展。此政策會查詢 Sagemaker 端點的叫用次數，並擴展部署 Pod 的數量。您可以在 https://keda.sh/docs/2.17/scalers/aws-sqs/：// 找到 KEDA 支援的aws-cloudwatch觸發參數完整清單。


apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.eu-west-1.amazonaws.com/account_id/QueueName
      queueLength: "5"  # Default: "5"
      awsRegion: "us-west-1"
      scaleOnInFlight: true
      identityOwner: operator

Prometheus 指標

下列 KEDA yaml 政策使用 Prometheus 指標作為觸發條件，以在 kubernetes 部署上執行自動擴展。此政策會查詢 Sagemaker 端點的叫用次數，並擴展部署 Pod 的數量。您可以在 https://keda.sh/docs/2.17/scalers/prometheus/：// 找到 KEDA 支援的aws-cloudwatch觸發參數完整清單。


apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://<prometheus-host>:9090
      query: avg(rate(http_requests_total{deployment="$DEPLOYMENT_NAME"}[2m])) # Note: query must return a vector/scalar single element response
      threshold: '100.50'
      namespace: example-namespace  # for namespaced queries, eg. Thanos
      customHeaders: X-Client-Id=cid,X-Tenant-Id=tid,X-Organization-Id=oid # Optional. Custom headers to include in query. In case of auth header, use the custom authentication or relevant authModes.
      unsafeSsl: "false" #  Default is `false`, Used for skipping certificate check when having self-signed certs for Prometheus endpoint    
      timeout: 1000 # Custom timeout for the HTTP client used in this scaler
      identityOwner: operator

CPU 指標

下列 KEDA yaml 政策使用 cpu 指標作為觸發條件，以在 kubernetes 部署上執行自動擴展。此政策會查詢 Sagemaker 端點的叫用次數，並擴展部署 Pod 的數量。您可以在 https://keda.sh/docs/2.17/scalers/prometheus/：// 找到 KEDA 支援的aws-cloudwatch觸發參數完整清單。


apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: cpu
    metricType: Utilization # Allowed types are 'Utilization' or 'AverageValue'
    metadata:
        value: "60"
        containerName: "" # Optional. You can use this to target a specific container

記憶體指標

下列 KEDA yaml 政策使用 Prometheus 指標查詢作為觸發條件，以在 kubernetes 部署上執行自動擴展。此政策會查詢 Sagemaker 端點的叫用次數，並擴展部署 Pod 的數量。您可以在 https://keda.sh/docs/2.17/scalers/prometheus/：// 找到 KEDA 支援的aws-cloudwatch觸發參數完整清單。


apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: memory
    metricType: Utilization # Allowed types are 'Utilization' or 'AverageValue'
    metadata:
        value: "60"
        containerName: "" # Optional. You can use this to target a specific container in a pod

用於縮減至 0 個 Pod 的範例 Prometheus 政策

下列 KEDA yaml 政策使用 prometheus 指標查詢作為觸發條件，以在 kubernetes 部署上執行自動擴展。此政策使用 0 minReplicaCount的，可讓 KEDA 將部署縮減至 0 個 Pod。當 minReplicaCount 設為 0 時，您需要提供啟用條件，才能在 Pod 縮減至 0 之後啟動第一個 Pod。對於 Prometheus 觸發，此值由提供activationThreshold。對於 SQS 佇列，它來自 activationQueueLength。

注意

使用 minReplicaCount 0 時，請確定啟用不依賴於 Pod 產生的指標。當 Pod 縮減至 0 時，永遠不會產生該指標，且 Pod 不會再次擴展。


apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 0 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  cooldownPeriod:  30
  initialCooldownPeriod:  180 # time before scaling down the pods after initial deployment
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://<prometheus-host>:9090
      query: sum(rate(http_requests_total{deployment="my-deployment"}[2m])) # Note: query must return a vector/scalar single element response
      threshold: '100.50'
      activationThreshold: '5.5' # Required if minReplicaCount is 0 for initial scaling
      namespace: example-namespace
      timeout: 1000
      identityOwner: operator

注意

只有當您定義至少一個非 CPU 或記憶體的額外擴展器（例如 SQS + CPU 或 Prometheus + CPU)。

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

使用 kubectl 從 Amazon S3 和 Amazon FSx 部署自訂微調模型

監控與可觀測性