先決條件步驟 1：建立基準工作基準推論元件多重 LoRA 端點基準測試使用合成映像為多模式端點建立基準使用合成影片為多模式端點建立基準步驟 2：監控任務狀態步驟 3：檢閱基準結果管理基準資源

生成式 AI 推論端點的基準指標

SageMaker AI 基準測試服務會測量 SageMaker AI 端點上託管的大型語言模型 (LLMs) 的效能。它使用 NVIDIA AIPerf 執行基準測試，產生請求延遲、輸送量、首次權杖前所經時間，以及權杖間延遲等指標。

先決條件

建立基準任務之前，您需要下列項目：

狀態為InService託管支援 OpenAI 相容聊天完成 API 之 LLM 的 SageMaker AI 端點
基準輸出的 Amazon S3 儲存貯體
IAM 執行角色，授予 SageMaker AI 存取端點和輸出儲存貯體的權限

步驟 1：建立基準工作

基準任務以特定 SageMaker AI 端點為目標，並參考工作負載組態。

Python (boto3)



response = client.create_ai_benchmark_job(
    AIBenchmarkJobName="my-benchmark-job",
    BenchmarkTarget={
        "Endpoint": {
            "Identifier": "my-sagemaker-endpoint"
        }
    },
    OutputConfig={
        "S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"
    },
    AIWorkloadConfigIdentifier="my-benchmark-config",
    RoleArn="arn:aws:iam::111122223333:role/ExampleRole",
)
print(response["AIBenchmarkJobArn"])

AWS CLI



aws sagemaker create-ai-benchmark-job \
  --ai-benchmark-job-name "my-benchmark-job" \
  --benchmark-target '{"Endpoint": {"Identifier": "my-sagemaker-endpoint"}}' \
  --output-config '{"S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"}' \
  --ai-workload-config-identifier "my-benchmark-config" \
  --role-arn "arn:aws:iam::111122223333:role/ExampleRole" \
  --region us-west-2

如果您的端點透過推論元件託管多個模型，您可以在的 InferenceComponents 參數中指定它們BenchmarkTarget。

如果您的端點位於 VPC 中，請使用您的VpcConfig設定傳遞 NetworkConfig 參數，包括安全群組 IDs和子網路。

基準推論元件

如果您的端點使用推論元件而非直接部署模型，您必須在中指定要進行基準測試的推論元件BenchmarkTarget。指定推論元件時，基準測試服務會將請求路由到這些特定元件，而不是端點的預設模型。

在InferenceComponents清單中傳遞一或多個推論元件名稱或 ARNs：

Python (boto3)



response = client.create_ai_benchmark_job(
    AIBenchmarkJobName="my-ic-benchmark",
    BenchmarkTarget={
        "Endpoint": {
            "Identifier": "my-multi-model-endpoint",
            "InferenceComponents": [
                {"Identifier": "my-inference-component-llama"}
            ]
        }
    },
    OutputConfig={
        "S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"
    },
    AIWorkloadConfigIdentifier="my-benchmark-config",
    RoleArn="arn:aws:iam::111122223333:role/ExampleRole",
)

AWS CLI



aws sagemaker create-ai-benchmark-job \
  --ai-benchmark-job-name "my-ic-benchmark" \
  --benchmark-target '{
    "Endpoint": {
      "Identifier": "my-multi-model-endpoint",
      "InferenceComponents": [
        {"Identifier": "my-inference-component-llama"}
      ]
    }
  }' \
  --output-config '{"S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"}' \
  --ai-workload-config-identifier "my-benchmark-config" \
  --role-arn "arn:aws:iam::111122223333:role/ExampleRole" \
  --region us-west-2

注意

如果您的端點已針對推論元件設定，但未在基準目標InferenceComponents中指定，則任務會失敗，並顯示沒有模型直接部署在端點上。對inference-component-based端點進行基準測試時，請務必包含 InferenceComponents 參數。

多重 LoRA 端點基準測試

若要對提供多個 LoRA 轉接器的端點進行基準測試，請在中將每個轉接器指定為推論元件BenchmarkTarget。您可以選擇性地使用model_selection_strategy工作負載參數來控制基準如何在轉接器之間分配請求。如果您未指定策略，則預設為 round_robin。

首先，建立工作負載組態。下列範例包含選用model_selection_strategy參數：



# Create a workload config for multi-LoRA benchmarking
workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "prompt_input_tokens_mean": 550,
        "output_tokens_mean": 150,
        "concurrency": 10,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B",
        "model_selection_strategy": "round_robin"
    },
    "secrets": {
        "hf_token": "arn:aws:secretsmanager:us-west-2:111122223333:secret:my-hf-token-AbCdEf"
    },
    "tooling": {"api_standard": "openai"}
}

import json
client.create_ai_workload_config(
    AIWorkloadConfigName="multi-lora-config",
    WorkloadSpec={"Inline": json.dumps(workload_spec)}
)

然後，建立以所有 LoRA 轉接器推論元件為目標的基準任務：



response = client.create_ai_benchmark_job(
    AIBenchmarkJobName="multi-lora-benchmark",
    BenchmarkTarget={
        "Endpoint": {
            "Identifier": "my-lora-endpoint",
            "InferenceComponents": [
                {"Identifier": "lora-adapter-customer-support"},
                {"Identifier": "lora-adapter-code-generation"},
                {"Identifier": "lora-adapter-summarization"}
            ]
        }
    },
    OutputConfig={
        "S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/multi-lora-results/"
    },
    AIWorkloadConfigIdentifier="multi-lora-config",
    RoleArn="arn:aws:iam::111122223333:role/ExampleRole",
)

model_selection_strategy 參數是選用的，並決定基準工具如何在指定的推論元件之間分配請求。有效的值如下：

round_robin （預設） — 每個轉接器會依序接收請求。第 n 個請求會傳送到 (n 個模型number-of-models)第個轉接器。
random — 每個請求都會隨機平均指派給轉接器。

如果您未指定 model_selection_strategy，round_robin根據預設，基準會使用。

使用合成映像為多模式端點建立基準

您可以在工作負載組態中產生合成映像，藉此對視覺語言模型進行基準測試。基準測試服務使用 AIPerf 建立具有可設定維度和格式的影像，然後將它們作為 base64 編碼的承載傳送至您的端點。

下列範例會建立工作負載組態，以使用合成映像對視覺語言模型進行基準測試：



import json

workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "image_width_mean": 640,
        "image_height_mean": 480,
        "prompt_input_tokens_mean": 100,
        "output_tokens_mean": 150,
        "concurrency": 8,
        "request_count": 100,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    },
    "secrets": {
        "hf_token": "arn:aws:secretsmanager:us-west-2:111122223333:secret:my-hf-token-AbCdEf"
    }
}

client.create_ai_workload_config(
    AIWorkloadConfigName="image-benchmark-config",
    WorkloadSpec={"Inline": json.dumps(workload_spec)}
)

下列參數控制合成映像產生：

參數	Type	預設	說明
`image_width_mean`	float	無	平均影像寬度，以像素為單位。
`image_width_stddev`	float	無	影像寬度的標準差。設定為跨請求變更影像維度。
`image_height_mean`	float	無	平均影像高度，以像素為單位。
`image_height_stddev`	float	無	影像高度的標準偏差。
`image_batch_size`	int	1	每個請求的影像數量。
`image_format`	string	png	映像格式。有效值： `png` （無損）、 `jpeg` （遺失、較小的檔案）、 `random`（每個影像隨機選取 PNG 或 JPEG)。

可變大小影像

使用標準差參數產生不同維度的影像，模擬影像大小不同的真實工作負載：



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "image_width_mean": 800,
        "image_width_stddev": 200,
        "image_height_mean": 600,
        "image_height_stddev": 150,
        "image_batch_size": 2,
        "concurrency": 4,
        "request_count": 50,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

使用合成影片為多模式端點建立基準

您可以在工作負載組態中產生合成影片，藉此對處理影片輸入的多模態模型進行基準測試。基準測試服務使用 AIPerf 的合成影片產生來建立具有可設定解析度、影格速率、持續時間和編碼的影片，然後將它們作為 base64 編碼的承載傳送至您的端點。

注意

影片產生預設為停用。您必須在工作負載組態video_height中同時指定 video_width和，才能啟用它。

下列範例會建立工作負載組態，以使用 640×480 解析度的合成影片對多模態模型進行基準測試：



import json

workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 640,
        "video_height": 480,
        "video_fps": 4,
        "video_duration": 5.0,
        "output_tokens_mean": 150,
        "concurrency": 4,
        "request_count": 50,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    },
    "secrets": {
        "hf_token": "arn:aws:secretsmanager:us-west-2:111122223333:secret:my-hf-token-AbCdEf"
    }
}

client.create_ai_workload_config(
    AIWorkloadConfigName="video-benchmark-config",
    WorkloadSpec={"Inline": json.dumps(workload_spec)}
)

影片參數

下列參數控制合成影片產生：

參數	Type	預設	說明
`video_width`	int	無	以像素為單位的影格寬度。必須使用設定`video_height`以啟用影片產生。
`video_height`	int	無	以像素為單位的影格高度。必須使用設定`video_width`以啟用影片產生。
`video_fps`	int	4	每秒影格數。
`video_duration`	float	5.0	以秒為單位的剪輯持續時間。
`video_batch_size`	int	1	每個請求的影片數量。
`video_synth_type`	string	moving_shapes	合成模式。有效值： `moving_shapes` （動畫幾何形狀）、 `grid_clock`（具有時鐘動畫的網格）、 `noise`（隨機像素雜訊）。
`video_format`	string	Webm	容器格式。有效值：`webm`。
`video_codec`	string	libvpx-vp9	視訊轉碼器。支援的值： `libvpx-vp9` (VP9、WebM)。

注意

基準測試服務僅支援使用 WebM 格式的 VP9 編碼。

內嵌音軌

對於同時處理視訊和音訊的模型，您可以在產生的視訊中嵌入合成音訊軌。音訊預設為停用。video_audio_num_channels 設定為 1（單聲道）或 2（立體聲）以啟用它。

參數	Type	預設	說明
`video_audio_num_channels`	int	0	0 = 停用，1 = 單聲道，2 = 立體聲。
`video_audio_sample_rate`	int	44100	以 Hz (8000–96000) 為單位的取樣率。
`video_audio_codec`	string	auto	音訊轉碼器。自動選取 `libvorbis` WebM 和 MP4 `aac`的。您可以使用 `aac`、 `libvorbis`或覆寫 `libopus`。
`video_audio_depth`	int	16	每個樣本的位元深度 (8、16、24 或 32)。

影片基準測試範例

低解析度影片理解



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 320,
        "video_height": 240,
        "video_fps": 2,
        "video_duration": 3.0,
        "video_synth_type": "moving_shapes",
        "concurrency": 4,
        "request_count": 50,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

HD 影片基準測試



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 1920,
        "video_height": 1080,
        "video_fps": 8,
        "video_duration": 10.0,
        "concurrency": 2,
        "request_count": 20,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

多模態模型的音訊視訊



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 640,
        "video_height": 480,
        "video_fps": 4,
        "video_duration": 5.0,
        "video_audio_num_channels": 1,
        "video_audio_sample_rate": 16000,
        "concurrency": 4,
        "request_count": 50,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

混合文字和影片

將影片與影片問答或字幕工作負載的文字提示結合：



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 640,
        "video_height": 480,
        "video_fps": 4,
        "video_duration": 5.0,
        "prompt_input_tokens_mean": 100,
        "output_tokens_mean": 50,
        "concurrency": 8,
        "request_count": 100,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

效能考量

較高的解析度和影格率會增加影片編碼時間和承載大小。對於高輸送量測試，請使用較低的解析度 (320 × 240 或 640 × 480)。
具有 WebM 格式的 VP9 (libvpx-vp9) 是唯一支援的轉碼器，並為基準化承載提供良好的壓縮。
相較於視訊串流，音訊增加的負荷最少。針對以語音為中心的工作負載，使用 16 kHz 的單聲道 (1)。

步驟 2：監控任務狀態

輪詢任務狀態，直到達到結束狀態為止。

Python (boto3)



import time

while True:
    response = client.describe_ai_benchmark_job(
        AIBenchmarkJobName="my-benchmark-job"
    )
    status = response["AIBenchmarkJobStatus"]
    print(f"Status: {status}")
    if status in ("Completed", "Failed", "Stopped"):
        break
    time.sleep(30)

if status == "Completed":
    print(f"Results at: {response['OutputConfig']['S3OutputLocation']}")
elif status == "Failed":
    print(f"Job failed: {response.get('FailureReason', 'unknown')}")

AWS CLI



aws sagemaker describe-ai-benchmark-job \
  --ai-benchmark-job-name "my-benchmark-job" \
  --region us-west-2

步驟 3：檢閱基準結果

基準結果會寫入您指定的 Amazon S3 輸出位置。結果包含下列關鍵指標：

request_throughput: 每秒請求數。
request_latency: 包含百分位數明細 (P50、P90、P99) 的End-to-end請求延遲。
time_to_first_token: 從提交請求到收到第一個字符的時間。
inter_token_latency: 連續輸出字符之間的時間。
output_token_throughput: 每秒產生的輸出字符。

每個指標都包含統計摘要：平均值、最小值、最大值、P50、P90、P99 和標準差。

管理基準資源

使用下列操作來管理您的基準任務和工作負載組態。



# List benchmark jobs
response = client.list_ai_benchmark_jobs(MaxResults=10)
for job in response["AIBenchmarkJobs"]:
    print(f"{job['AIBenchmarkJobName']} - {job['AIBenchmarkJobStatus']}")

# Stop a running job
client.stop_ai_benchmark_job(
    AIBenchmarkJobName="my-benchmark-job"
)

# Delete a job (must be in a terminal state)
client.delete_ai_benchmark_job(
    AIBenchmarkJobName="my-benchmark-job"
)

# List workload configurations
response = client.list_ai_workload_configs(MaxResults=10)
for config in response["AIWorkloadConfigs"]:
    print(f"{config['AIWorkloadConfigName']} - {config['AIWorkloadConfigArn']}")

# Delete a workload configuration
client.delete_ai_workload_config(
    AIWorkloadConfigName="my-benchmark-config"
)

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

取得建議

部署建議