先决条件步骤 1：创建基准作业基准推理组件基准测试多 Lora 端点使用合成图像对多模态端点进行基准测试使用合成视频对多模态端点进行基准测试步骤 2：监控作业状态第 3 步：查看基准测试结果管理基准测试资源

基准生成式 AI 推理端点

SageMaker AI 基准测试服务用于衡量 A SageMaker I 端点上托管的大型语言模型 (LLM) 的性能。它使用 NVIDIA aiPerf 运行基准测试，生成诸如请求延迟、吞吐量、第一个令牌的时间和代币间延迟等指标。

先决条件

在创建基准测试任务之前，您需要满足以下条件：

处于InService状态的 A SageMaker I 端点，托管支持 OpenAI-compatible 聊天完成 API 的 LLM
用于基准测试输出的 Amazon S3 存储桶
一个 IAM 执行角色，可授予 SageMaker AI 访问您的终端节点和输出存储桶的权限

步骤 1：创建基准作业

基准测试作业以特定的 SageMaker AI 端点为目标，并引用工作负载配置。

Python (boto3)



response = client.create_ai_benchmark_job(
    AIBenchmarkJobName="my-benchmark-job",
    BenchmarkTarget={
        "Endpoint": {
            "Identifier": "my-sagemaker-endpoint"
        }
    },
    OutputConfig={
        "S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"
    },
    AIWorkloadConfigIdentifier="my-benchmark-config",
    RoleArn="arn:aws:iam::111122223333:role/ExampleRole",
)
print(response["AIBenchmarkJobArn"])

AWS CLI



aws sagemaker create-ai-benchmark-job \
  --ai-benchmark-job-name "my-benchmark-job" \
  --benchmark-target '{"Endpoint": {"Identifier": "my-sagemaker-endpoint"}}' \
  --output-config '{"S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"}' \
  --ai-workload-config-identifier "my-benchmark-config" \
  --role-arn "arn:aws:iam::111122223333:role/ExampleRole" \
  --region us-west-2

如果您的终端节点通过推理组件托管多个模型，则可以在的InferenceComponents参数中指定它们。BenchmarkTarget

如果您的终端节点位于 VPC 中，请将NetworkConfig参数与您的VpcConfig设置（包括安全组 ID 和子网）一起传递。

基准推理组件

如果您的终端节点使用推理组件而不是直接部署模型，则必须在中指定要进行基准测试的推理组件。BenchmarkTarget指定推理组件后，基准测试服务会将请求路由到这些特定组件，而不是端点的默认模型。

在列表中传递一个或多个推理组件名称或 ARN：InferenceComponents

Python (boto3)



response = client.create_ai_benchmark_job(
    AIBenchmarkJobName="my-ic-benchmark",
    BenchmarkTarget={
        "Endpoint": {
            "Identifier": "my-multi-model-endpoint",
            "InferenceComponents": [
                {"Identifier": "my-inference-component-llama"}
            ]
        }
    },
    OutputConfig={
        "S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"
    },
    AIWorkloadConfigIdentifier="my-benchmark-config",
    RoleArn="arn:aws:iam::111122223333:role/ExampleRole",
)

AWS CLI



aws sagemaker create-ai-benchmark-job \
  --ai-benchmark-job-name "my-ic-benchmark" \
  --benchmark-target '{
    "Endpoint": {
      "Identifier": "my-multi-model-endpoint",
      "InferenceComponents": [
        {"Identifier": "my-inference-component-llama"}
      ]
    }
  }' \
  --output-config '{"S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"}' \
  --ai-workload-config-identifier "my-benchmark-config" \
  --role-arn "arn:aws:iam::111122223333:role/ExampleRole" \
  --region us-west-2

注意

如果您的终端节点是为推理组件配置的，但您没有在基准测试目标InferenceComponents中进行指定，则任务将失败，并显示一条错误消息，表明没有直接在终端节点上部署模型。对基于推理组件的端点进行基准测试时，请务必包含该InferenceComponents参数。

基准测试多 Lora 端点

要对服务于多个 LoRa 适配器的端点进行基准测试，请在中将每个适配器指定为推理组件。BenchmarkTarget您可以选择使用model_selection_strategy工作负载参数来控制基准测试如何跨适配器分配请求。如果您未指定策略，则默认为round_robin。

首先，创建工作负载配置。以下示例包括可选model_selection_strategy参数：



# Create a workload config for multi-LoRA benchmarking
workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "prompt_input_tokens_mean": 550,
        "output_tokens_mean": 150,
        "concurrency": 10,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B",
        "model_selection_strategy": "round_robin"
    },
    "secrets": {
        "hf_token": "arn:aws:secretsmanager:us-west-2:111122223333:secret:my-hf-token-AbCdEf"
    },
    "tooling": {"api_standard": "openai"}
}

import json
client.create_ai_workload_config(
    AIWorkloadConfigName="multi-lora-config",
    WorkloadSpec={"Inline": json.dumps(workload_spec)}
)

然后，创建一个针对所有 LoRa 适配器推理组件的基准测试作业：



response = client.create_ai_benchmark_job(
    AIBenchmarkJobName="multi-lora-benchmark",
    BenchmarkTarget={
        "Endpoint": {
            "Identifier": "my-lora-endpoint",
            "InferenceComponents": [
                {"Identifier": "lora-adapter-customer-support"},
                {"Identifier": "lora-adapter-code-generation"},
                {"Identifier": "lora-adapter-summarization"}
            ]
        }
    },
    OutputConfig={
        "S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/multi-lora-results/"
    },
    AIWorkloadConfigIdentifier="multi-lora-config",
    RoleArn="arn:aws:iam::111122223333:role/ExampleRole",
)

该model_selection_strategy参数是可选的，它决定了基准测试工具如何在指定的推理组件之间分配请求。有效值为：

round_robin（默认）— 每个适配器按顺序接收请求。第 n 个请求被发送到（n 个模组型号）第 n 个适配器。
random— 每个请求均随机统一分配给适配器。

如果您未指定model_selection_strategy，则基准测试round_robin默认使用。

使用合成图像对多模态端点进行基准测试

您可以通过生成合成图像作为工作负载配置的一部分来对视觉语言模型进行基准测试。基准测试服务使用 aiPerf 创建具有可配置尺寸和格式的图像，然后将其作为 base64 编码的有效负载发送到您的端点。

以下示例创建了一个工作负载配置，用于使用合成图像对视觉语言模型进行基准测试：



import json

workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "image_width_mean": 640,
        "image_height_mean": 480,
        "prompt_input_tokens_mean": 100,
        "output_tokens_mean": 150,
        "concurrency": 8,
        "request_count": 100,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    },
    "secrets": {
        "hf_token": "arn:aws:secretsmanager:us-west-2:111122223333:secret:my-hf-token-AbCdEf"
    }
}

client.create_ai_workload_config(
    AIWorkloadConfigName="image-benchmark-config",
    WorkloadSpec={"Inline": json.dumps(workload_spec)}
)

以下参数控制合成图像的生成：

参数	Type	默认值	说明
`image_width_mean`	浮点数	无	平均图像宽度（以像素为单位）。
`image_width_stddev`	浮点数	无	图像宽度的标准差。设置为根据请求改变图像尺寸。
`image_height_mean`	浮点数	无	平均图像高度（以像素为单位）。
`image_height_stddev`	浮点数	无	图像高度的标准差。
`image_batch_size`	int	1	每次请求的图像数量。
`image_format`	字符串	png	映像格式有效值：`png`（无损）、（有损、较小的文件）、`jpegrandom`（每张图像随机选择 PNG 或 JPEG）。

Variable-size 图片

使用标准差参数生成不同尺寸的图像，模拟图像大小不同的现实工作负载：



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "image_width_mean": 800,
        "image_width_stddev": 200,
        "image_height_mean": 600,
        "image_height_stddev": 150,
        "image_batch_size": 2,
        "concurrency": 4,
        "request_count": 50,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

使用合成视频对多模态端点进行基准测试

您可以通过生成合成视频作为工作负载配置的一部分来对处理视频输入的多模态模型进行基准测试。基准测试服务使用 aiPerf 的合成视频生成功能来创建分辨率、帧速率、时长和编码可配置的视频，然后将其作为 base64 编码的有效负载发送到您的端点。

注意

默认情况下，视频生成处于禁用状态。您必须在工作负载配置video_height中同时指定video_width和才能将其启用。

以下示例创建了一个工作负载配置，用于使用分辨率为 640×480 的合成视频对多模式模型进行基准测试：



import json

workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 640,
        "video_height": 480,
        "video_fps": 4,
        "video_duration": 5.0,
        "output_tokens_mean": 150,
        "concurrency": 4,
        "request_count": 50,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    },
    "secrets": {
        "hf_token": "arn:aws:secretsmanager:us-west-2:111122223333:secret:my-hf-token-AbCdEf"
    }
}

client.create_ai_workload_config(
    AIWorkloadConfigName="video-benchmark-config",
    WorkloadSpec={"Inline": json.dumps(workload_spec)}
)

视频参数

以下参数控制合成视频的生成：

参数	Type	默认值	说明
`video_width`	int	无	帧宽度（以像素为单位）。必须设置为`video_height`才能启用视频生成。
`video_height`	int	无	帧高度（以像素为单位）。必须设置为`video_width`才能启用视频生成。
`video_fps`	int	4	每秒帧数。
`video_duration`	浮点数	5.0	剪辑时长（以秒为单位）。
`video_batch_size`	int	1	每次请求的视频数量。
`video_synth_type`	字符串	移动形状	合成模式。有效值：`moving_shapes`（动画几何形状）、`grid_clock`（带有时钟动画的网格）、`noise`（随机像素噪点）。
`video_format`	字符串	webm	容器格式。有效值：`webm`。
`video_codec`	字符串	libvpx-vp9	视频编解码器。支持的值：`libvpx-vp9`(VP9，WebM)。

注意

基准测试服务仅支持 WebM 格式的 VP9 编码。

嵌入式音轨

对于同时处理视频和音频的模型，您可以在生成的视频中嵌入合成音轨。默认情况下，音频处于禁用状态。video_audio_num_channels将其设置为1（mono）或2（立体声）可将其启用。

参数	Type	默认值	说明
`video_audio_num_channels`	整数	0	0 = 禁用，1 = 单声道，2 = 立体声。
`video_audio_sample_rate`	int	44100	以 Hz 为单位的采样率 (8000—96000)。
`video_audio_codec`	字符串	自动	音频编解码器。 Auto-selects `libvorbis`适用于 WebM 和 MP4 `aac`。可以用`aaclibvorbis`、或覆盖`libopus`。
`video_audio_depth`	int	16	每个样本的位深度（8、16、24 或 32）。

视频基准测试示例

Low-resolution 视频理解



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 320,
        "video_height": 240,
        "video_fps": 2,
        "video_duration": 3.0,
        "video_synth_type": "moving_shapes",
        "concurrency": 4,
        "request_count": 50,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

高清视频基准测试



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 1920,
        "video_height": 1080,
        "video_fps": 8,
        "video_duration": 10.0,
        "concurrency": 2,
        "request_count": 20,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

适用于多模式车型的带音频的视频



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 640,
        "video_height": 480,
        "video_fps": 4,
        "video_duration": 5.0,
        "video_audio_num_channels": 1,
        "video_audio_sample_rate": 16000,
        "concurrency": 4,
        "request_count": 50,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

混合文字和视频

将视频与文本提示相结合，用于视频问答或字幕工作负载：



workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "video_width": 640,
        "video_height": 480,
        "video_fps": 4,
        "video_duration": 5.0,
        "prompt_input_tokens_mean": 100,
        "output_tokens_mean": 50,
        "concurrency": 8,
        "request_count": 100,
        "streaming": True,
        "tokenizer": "meta-llama/Llama-3.2-1B"
    }
}

性能注意事项

更高的分辨率和帧速率会增加视频编码时间和有效载荷大小。对于高通量测试，请使用较低的分辨率（320×240 或 640×480）。
采用 WebM 格式的 VP9 (libvpx-vp9) 是唯一支持的编解码器，可为负载基准测试提供良好的压缩效果。
与视频流相比，音频增加的开销最小。使用频率为 16 kHz 的 mono (1) 处理以语音为重点的工作负载。

步骤 2：监控作业状态

轮询任务状态，直到其达到终止状态。

Python (boto3)



import time

while True:
    response = client.describe_ai_benchmark_job(
        AIBenchmarkJobName="my-benchmark-job"
    )
    status = response["AIBenchmarkJobStatus"]
    print(f"Status: {status}")
    if status in ("Completed", "Failed", "Stopped"):
        break
    time.sleep(30)

if status == "Completed":
    print(f"Results at: {response['OutputConfig']['S3OutputLocation']}")
elif status == "Failed":
    print(f"Job failed: {response.get('FailureReason', 'unknown')}")

AWS CLI



aws sagemaker describe-ai-benchmark-job \
  --ai-benchmark-job-name "my-benchmark-job" \
  --region us-west-2

第 3 步：查看基准测试结果

基准测试结果将写入您指定的 Amazon S3 输出位置。结果包括以下关键指标：

request_throughput: 每秒请求数。
request_latency: End-to-end 带有百分位细分的请求延迟（P50、P90、P99）。
time_to_first_token: 从提交请求到收到第一个令牌的时间。
inter_token_latency: 连续输出令牌之间的时间。
output_token_throughput: 每秒生成的输出令牌。

每个指标都包括统计摘要：平均值、最小值、最大值、P50、P90、P99 和标准差。

管理基准测试资源

使用以下操作来管理您的基准测试任务和工作负载配置。



# List benchmark jobs
response = client.list_ai_benchmark_jobs(MaxResults=10)
for job in response["AIBenchmarkJobs"]:
    print(f"{job['AIBenchmarkJobName']} - {job['AIBenchmarkJobStatus']}")

# Stop a running job
client.stop_ai_benchmark_job(
    AIBenchmarkJobName="my-benchmark-job"
)

# Delete a job (must be in a terminal state)
client.delete_ai_benchmark_job(
    AIBenchmarkJobName="my-benchmark-job"
)

# List workload configurations
response = client.list_ai_workload_configs(MaxResults=10)
for config in response["AIWorkloadConfigs"]:
    print(f"{config['AIWorkloadConfigName']} - {config['AIWorkloadConfigArn']}")

# Delete a workload configuration
client.delete_ai_workload_config(
    AIWorkloadConfigName="my-benchmark-config"
)

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

获取建议

部署建议