本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 模型评估作业提交
<a name="model-customize-open-weight-evaluation"></a>

本节介绍开放权重自定义模型评估。它可以让你开始逐步完成评估工作提交流程。为更高级的评估作业提交用例提供了更多资源。

**Topics**
+ [开始使用](model-customize-evaluation-getting-started.md)
+ [评估类型和 Job 提交](model-customize-evaluation-types.md)
+ [评估指标格式](model-customize-evaluation-metrics-formats.md)
+ [Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式](model-customize-evaluation-dataset-formats.md)
+ [使用预设和自定义评分器进行评估](model-customize-evaluation-preset-custom-scorers.md)

# 开始使用
<a name="model-customize-evaluation-getting-started"></a>

## 通过 SageMaker Studio 提交评估任务
<a name="model-customize-evaluation-studio"></a>

### 第 1 步：从您的模型卡片导航至 “评估”
<a name="model-customize-evaluation-studio-step1"></a>

自定义模型后，从模型卡片导航到评估页面。

有关开放式权重自定义模型训练的信息：[https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job](https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job.html) .html

SageMaker 在 “我的模型” 选项卡上可视化您的自定义模型：

![\[注册模特卡页面\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-registered-model-card.png)


选择查看最新版本，然后选择评估：

![\[模型定制页面\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-evaluate-from-model-card.png)


### 第 2 步：提交您的评估 Job
<a name="model-customize-evaluation-studio-step2"></a>

选择 “提交” 按钮并提交您的评估任务。这将提交一个最低限度的 MMLU 基准测试作业。

有关支持的评估任务类型的信息，请参阅[评估类型和 Job 提交](model-customize-evaluation-types.md)。

![\[评估作业提交页面\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-benchmark-submission.png)


### 第 3 步：跟踪您的评估 Job 进度
<a name="model-customize-evaluation-studio-step3"></a>

您的评估工作进度将在评估步骤选项卡中进行跟踪：

![\[您的评估工作进度\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-benchmark-tracking.png)


### 步骤 4：查看您的评估 Job 结果
<a name="model-customize-evaluation-studio-step4"></a>

您的评估任务结果显示在 “评估结果” 选项卡中：

![\[您的评估工作指标\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-benchmark-results.png)


### 步骤 5：查看已完成的评估
<a name="model-customize-evaluation-studio-step5"></a>

您完成的评估任务将显示在模型卡的评估中：

![\[你已完成的评估工作\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-benchmark-completed-model-card.png)


## 通过 SageMaker Python SDK 提交你的评估任务
<a name="model-customize-evaluation-sdk"></a>

### 第 1 步：创建你的 BenchMarkEvaluator
<a name="model-customize-evaluation-sdk-step1"></a>

将您注册的训练模型、 AWS S3 输出位置和 MLFlow 资源 ARN 传递给，`BenchMarkEvaluator`然后对其进行初始化。

```
from sagemaker.train.evaluate import BenchMarkEvaluator, Benchmark  
  
evaluator = BenchMarkEvaluator(  
    benchmark=Benchmark.MMLU,  
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",  
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",  
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",  
    evaluate_base_model=False  
)
```

### 第 2 步：提交您的评估 Job
<a name="model-customize-evaluation-sdk-step2"></a>

调用`evaluate()`方法提交评估作业。

```
execution = evaluator.evaluate()
```

### 第 3 步：跟踪您的评估 Job 进度
<a name="model-customize-evaluation-sdk-step3"></a>

调用执行`wait()`方法以获取评估任务进度的实时更新。

```
execution.wait(target_status="Succeeded", poll=5, timeout=3600)
```

### 步骤 4：查看您的评估 Job 结果
<a name="model-customize-evaluation-sdk-step4"></a>

调用该`show_results()`方法以显示您的评估作业结果。

```
execution.show_results()
```

# 评估类型和 Job 提交
<a name="model-customize-evaluation-types"></a>

## 使用标准化数据集进行基准测试
<a name="model-customize-evaluation-benchmarking"></a>

使用基准评估类型在标准化基准数据集（包括 MMLU 和 BBH 等热门数据集）中评估模型的质量。


| 基准 | 支持自定义数据集 | 模式 | 说明 | 指标 | Strategy | 可用子任务 | 
| --- | --- | --- | --- | --- | --- | --- | 
| mmlu | 否 | 文本 | 多任务语言理解：考核 57 个科目的知识。 | 准确性 | zs\$1cot | 是 | 
| mmlu\$1pro | 否 | 文本 | MMLU（专业子集），专注于法律、医学、会计和工程等专业领域。 | 准确性 | zs\$1cot | 否 | 
| bbh | 否 | 文本 | 高级推理任务：一系列具有挑战性的问题，用于考核更高级别的认知和解决问题的能力。 | 准确性 | fs\$1cot | 是 | 
| gpqa | 否 | 文本 | 一般物理问题解答：评测对物理概念和相关问题解决能力的理解情况。 | 准确性 | zs\$1cot | 否 | 
| math | 否 | 文本 | 数学问题解决：衡量在代数、微积分及应用题等领域的数学推理能力。 | exact\$1match | zs\$1cot | 是 | 
| strong\$1reject | 否 | 文本 | 质量控制任务-测试模型检测和拒绝不当、有害或不正确内容的能力。 | deflection | zs | 是 | 
| ifeval | 否 | 文本 | 指令跟随评估：衡量模型遵循给定指令并按照规范完成任务的准确程度。 | 准确性 | zs | 否 | 

有关 BYOD 格式的更多信息，请参阅[Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式](model-customize-evaluation-dataset-formats.md)。

### 可用子任务
<a name="model-customize-evaluation-benchmarking-subtasks"></a>

以下列出了跨多个领域进行模型评估的可用子任务，包括 MMLU（大规模多任务语言理解）、BBH（Big Bench Hard）和 MATH。 StrongReject这些子任务让您能够评测模型在特定功能和知识领域的表现。

**MMLU 子任务**

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

**BBH 子任务**

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

**数学子任务**

```
MATH_SUBTASKS = [
    "algebra", 
    "counting_and_probability", 
    "geometry",
    "intermediate_algebra", 
    "number_theory", 
    "prealgebra", 
    "precalculus"
]
```

**StrongReject 子任务**

```
STRONG_REJECT_SUBTASKS = [
    "gcg_transfer_harmbench", 
    "gcg_transfer_universal_attacks",
    "combination_3", 
    "combination_2", 
    "few_shot_json", 
    "dev_mode_v2",
    "dev_mode_with_rant",
    "wikipedia_with_title", 
    "distractors",
    "wikipedia",
     "style_injection_json", 
    "style_injection_short",
    "refusal_suppression", 
    "prefix_injection", 
    "distractors_negated",
    "poems", 
    "base64", 
    "base64_raw", "
    base64_input_only",
    "base64_output_only", 
    "evil_confidant", 
    "aim", 
    "rot_13",
    "disemvowel", 
    "auto_obfuscation", 
    "auto_payload_splitting", 
    "pair",
    "pap_authority_endorsement", 
    "pap_evidence_based_persuasion",
    "pap_expert_endorsement", 
    "pap_logical_appeal", 
    "pap_misrepresentation"
]
```

### 提交您的基准测试作业
<a name="model-customize-evaluation-benchmarking-submit"></a>

------
#### [ SageMaker Studio ]

![\[通过 SageMaker Studio 进行基准测试的最低配置\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import get_benchmarks
from sagemaker.train.evaluate import BenchMarkEvaluator

Benchmark = get_benchmarks()

# Create evaluator with MMLU benchmark
evaluator = BenchMarkEvaluator(
benchmark=Benchmark.MMLU,
model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
s3_output_path="s3://<bucket-name>/<prefix>/",
evaluate_base_model=False
)

execution = evaluator.evaluate()
```

有关通过 SageMaker Python SDK 提交评估作业的更多信息，请参阅：[https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

## 大型语言模型作为评判 (LLMAJ) 评估
<a name="model-customize-evaluation-llmaj"></a>

使用 LLM-as-a-judge (LLMAJ) 评估来利用另一个前沿模型对目标模型的响应进行评分。你可以通过调用 `create_evaluation_job` API 启动评估作业，使用 B AWS edrock 模型作为评委。

有关支持的评判模型的更多信息，请参阅：[https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)

您可以使用两种不同的指标格式来定义评估：
+ **内置指标：**利用 B AWS edrock 内置指标来分析模型推理响应的质量。欲了解更多信息，请参阅：[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt .html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html)
+ **自定义指标：**使用 Bedrock 评估自定义指标格式定义自己的自定义指标，使用您自己的指令分析模型推理响应的质量。欲了解更多信息，请参阅：[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

### 提交内置指标 LLMAJ 作业
<a name="model-customize-evaluation-llmaj-builtin"></a>

------
#### [ SageMaker Studio ]

![\[通过 Studio 进行 LLMAJ 基准测试的最低配置 SageMaker\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/llmaj-as-judge-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import LLMAsJudgeEvaluator

evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"],
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

有关通过 SageMaker Python SDK 提交评估作业的更多信息，请参阅：[https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

### 提交自定义指标 LLMAJ 作业
<a name="model-customize-evaluation-llmaj-custom"></a>

定义您的自定义指标：

```
{
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}
```

欲了解更多信息，请参阅：[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

------
#### [ SageMaker Studio ]

![\[通过 “自定义指标” > “添加自定义指标” 上传自定义指标\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/custom-llmaj-metrics-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    custom_metrics=custom_metric_dict = {
        "customMetricDefinition": {
            "name": "PositiveSentiment",
            "instructions": (
                "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
                "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
                "Consider the following:\n"
                "- Does the response have a positive, encouraging tone?\n"
                "- Is the response helpful and constructive?\n"
                "- Does it avoid negative language or criticism?\n\n"
                "Rate on this scale:\n"
                "- Good: Response has positive sentiment\n"
                "- Poor: Response lacks positive sentiment\n\n"
                "Here is the actual task:\n"
                "Prompt: {{prompt}}\n"
                "Response: {{prediction}}"
            ),
            "ratingScale": [
                {"definition": "Good", "value": {"floatValue": 1}},
                {"definition": "Poor", "value": {"floatValue": 0}}
            ]
        }
    },
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)
```

------

## 自定义记分器
<a name="model-customize-evaluation-custom-scorers"></a>

定义您自己的自定义记分器函数以启动评估作业。该系统提供了两个内置得分器：素数和素码。你也可以自带记分器功能。您可以直接复制您的记分器函数代码，也可以使用关联的 ARN 自带自己的 Lambda 函数定义。默认情况下，两种得分手类型都会生成评估结果，其中包括标准指标，例如 F1 分数、ROUGE 和 BLEU。

有关内置和自定义评分器及其各自要求/合同的更多信息，请参阅。[使用预设和自定义评分器进行评估](model-customize-evaluation-preset-custom-scorers.md)

### 注册您的数据集
<a name="model-customize-evaluation-custom-scorers-register-dataset"></a>

通过将自己的数据集注册为 SageMaker Hub 内容数据集，为自定义评分者带来自己的数据集。

------
#### [ SageMaker Studio ]

在 Studio 中，使用专用的数据集页面上传您的数据集。

![\[SageMaker Studio 中注册的评估数据集\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/dataset-registration-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

在 SageMaker Python SDK 中，使用专用的数据集页面上传您的数据集。

```
from sagemaker.ai_registry.dataset import DataSet

dataset = DataSet.create(
    name="your-bring-your-own-dataset",
    source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl"
)
dataset.refresh()
```

------

### 提交内置记分员作业
<a name="model-customize-evaluation-custom-scorers-builtin"></a>

------
#### [ SageMaker Studio ]

![\[从代码执行或数学答案中选择内置自定义评分\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/builtin-scorer-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import CustomScorerEvaluator
from sagemaker.train.evaluate import get_builtin_metrics

BuiltInMetric = get_builtin_metrics()

evaluator_builtin = CustomScorerEvaluator(
    evaluator=BuiltInMetric.PRIME_MATH,
    dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

从 “内置评分” 中选择`BuiltInMetric.PRIME_MATH`或`BuiltInMetric.PRIME_CODE`。

------

### 提交自定义记分员作业
<a name="model-customize-evaluation-custom-scorers-custom"></a>

定义自定义奖励函数。有关更多信息，请参阅 [自定义评分器（自带指标）](model-customize-evaluation-preset-custom-scorers.md#model-customize-evaluation-custom-scorers-byom)。

**注册自定义奖励功能**

------
#### [ SageMaker Studio ]

![\[导航到 SageMaker Studio > 资产 > 评估器 > 创建评估器 > 创建奖励函数\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/custom-scorer-submission-sagemaker-studio.png)


![\[在 Custom Scorer > Custom Metrics 中提交 Custom Scorer 评估作业，引用注册的预设\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/custom-scorer-benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.ai_registry.evaluator import Evaluator
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION

evaluator = Evaluator.create(
    name = "your-reward-function-name",
    source="/path_to_local/custom_lambda_function.py",
    type = REWARD_FUNCTION
)
```

```
evaluator = CustomScorerEvaluator(
    evaluator=evaluator,
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

# 评估指标格式
<a name="model-customize-evaluation-metrics-formats"></a>

使用以下指标格式评估模型的质量：
+ 模型评估摘要
+ MLFlow
+ TensorBoard

## 模型评估摘要
<a name="model-customize-evaluation-metrics-summary"></a>

提交评估任务时，您需要指定 S AWS 3 的输出位置。 SageMaker 自动将评估摘要.json 文件上传到该位置。基准摘要 S3 路径如下：

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/eval_results/
```

**通过 S AWS 3 地点**

------
#### [ SageMaker Studio ]

![\[传递到输出项目位置 (AWS S3 URI)\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

`.json`从 AWS S3 位置直接读取或在 UI 中自动可视化：

```
{
  "results": {
    "custom|gen_qa_gen_qa|0": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    },
    "all": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    }
  }
}
```

![\[在 Studio 中可视化的自定义 gen-qa 基准测试的性能指标示例 SageMaker\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/gen-qa-metrics-visualization-sagemaker-studio.png)


## MLFlow 日志记录
<a name="model-customize-evaluation-metrics-mlflow"></a>

**提供您的 SageMaker MLFlow 资源 ARN**

SageMaker 当您首次使用模型自定义功能时，Studio 使用在每个 Studio 域上配置的默认 MLFlow 应用程序。 SageMaker Studio 在提交评估作业时使用与 MLflow 应用关联的默认ARN。

您也可以提交评估任务并明确提供 MLFlow 资源 ARN，以便将指标流式传输到上述关联追踪中 server/app 进行实时分析。

**SageMaker Python SD**

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

模型级别和系统级指标可视化：

![\[MMLU 基准测试任务的样本模型级误差和精度\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/model-metrics-mlflow.png)


![\[LLMAJ 基准测试任务的内置指标示例\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/llmaj-metrics-mlflow.png)


![\[MMLU 基准测试任务的系统级指标示例\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/system-metrics-mlflow.png)


## TensorBoard
<a name="model-customize-evaluation-metrics-tensorboard"></a>

使用 AWS S3 输出位置提交您的评估任务。 SageMaker 自动将 TensorBoard 文件上传到该位置。

SageMaker 将 TensorBoard 文件上传到以下位置的 AWS S3：

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/tensorboard_results/eval/
```

**按如下方式传递 AWS S3 位置**

------
#### [ SageMaker Studio ]

![\[传递到输出项目位置 (AWS S3 URI)\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

**模型级别指标示例**

![\[SageMaker TensorBoard 显示基准测试工作的结果\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/metrics-in-tensorboard.png)


# Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式
<a name="model-customize-evaluation-dataset-formats"></a>

自定义评分器和 LLM-as-judge评估类型需要位于 S3 中的自定义数据集 JSONL 文件。 AWS 您必须以符合以下支持的格式之一的 JSON Lines 文件形式提供该文件。为清楚起见，本文档中的示例进行了扩展。

每种格式都有自己的细微差别，但至少所有格式都需要用户提示。


**必填字段**  

| 字段 | 必需 | 
| --- | --- | 
| 用户提示 | 是 | 
| 系统提示 | 否 | 
| 实地真相 | 仅适用于自定义得分手 | 
| 类别 | 否 | 

**1。OpenAI 格式**

```
{
    "messages": [
        {
            "role": "system",    # System prompt (looks for system role)
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",       # Query (looks for user role)
            "content": "Hello!"
        },
        {
            "role": "assistant",  # Ground truth (looks for assistant role)
            "content": "Hello to you!"
        }
    ]
}
```

**2。 SageMaker **评估

```
{
   "system":"You are an English major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?", # Ground truth
   "category": "Grammar"
}
```

**3。 HuggingFace 立即完成**

支持标准格式和对话格式。

```
# Standard

{
    "prompt" : "What is the symbol that ends the sentence as a question", # Query
    "completion" : "?" # Ground truth
}

# Conversational
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What is the symbol that ends the sentence as a question"
        }
    ],
    "completion": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "?"
        }
    ]
}
```

**4。 HuggingFace 偏好**

Support 支持标准格式（字符串）和对话格式（消息数组）。

```
# Standard: {"prompt": "text", "chosen": "text", "rejected": "text"}
{
     "prompt" : "The sky is", # Query
     "chosen" : "blue", # Ground truth
     "rejected" : "green"
}

# Conversational:
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What color is the sky?"
        }
    ],
    "chosen": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "It is blue."
        }
    ],
    "rejected": [
        {
            "role": "assistant",
            "content": "It is green."
        }
    ]
}
```

**5。Verl 格式**

强化学习用例支持 Verl 格式（当前格式和传统格式）。Verl 文档供参考：[https://verl.readthedocs.io/en/latest/preparation/prepare\$1data.html](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)

VERL 格式的用户通常不提供真实答复。无论如何都要提供一个，请`extra_info`优先使用`extra_info.answer`或`reward_model.ground_truth`; 中的任何一个字段。

SageMaker 如果存在以下特定于 VERL 的字段，则保留为元数据：
+ `id`
+ `data_source`
+ `ability`
+ `reward_model`
+ `extra_info`
+ `attributes`
+ `difficulty`

```
# Newest VERL format where `prompt` is an array of messages.
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "You are a helpful math tutor who explains solutions to questions step-by-step.",
      "role": "system"
    },
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  },
  "reward_model": {
    "ground_truth": "72" # Ignored in favor of extra_info.answer
  }
}

# Legacy VERL format where `prompt` is a string. Also supported.
{
  "data_source": "openai/gsm8k",
  "prompt": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

# 使用预设和自定义评分器进行评估
<a name="model-customize-evaluation-preset-custom-scorers"></a>

使用自定义评分器评估类型时， SageMaker 评估支持两个内置得分器（也称为 “奖励函数”）Prime Math 和 Prime Code，取自 v [olcengine/verl RL](https://github.com/volcengine/verl) 训练库，或者您自己的自定义评分器实现为 Lambda 函数。

## 内置记分器
<a name="model-customize-evaluation-builtin-scorers"></a>

**初级数学**

主要数学得分手期望有一个自定义 JSONL 条目数据集，其中包含数学问题作为基本真相， prompt/query 正确答案作为基本事实。数据集可以是中提到的任何一种支持的格式[Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式](model-customize-evaluation-dataset-formats.md)。

数据集条目示例（为清楚起见进行了扩展）：

```
{
    "system":"You are a math expert: ",
    "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    "response":"2" # Ground truth aka correct answer
}
```

**Prime 代码**

Prime code scorer 需要一个自定义 JSONL 数据集，该数据集包含该字段中指定的编码问题和测试用例。`metadata`使用每个条目的预期函数名称、样本输入和预期输出来构造测试用例。

数据集条目示例（为清楚起见进行了扩展）：

```
{
    "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n",
    "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task:  \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.",
    "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1.
    ### Define test cases in metadata field
    "metadata": {
        "fn_name": "factorialNumbers",
        "inputs": ["5"],
        "outputs": ["[1, 2]"]
    }
}
```

## 自定义评分器（自带指标）
<a name="model-customize-evaluation-custom-scorers-byom"></a>

使用自定义的后处理逻辑完全自定义您的模型评估工作流程，该逻辑允许您计算根据自己的需求量身定制的自定义指标。您必须将自定义评分器实现为 Lamb AWS da 函数，该函数接受模型响应并返回奖励分数。

### Lambda 输入负载示例
<a name="model-customize-evaluation-custom-scorers-lambda-input"></a>

您的自定义 AWS Lambda 要求输入采用 OpenAI 格式。示例：

```
{
    "id": "123",
    "messages": [
        {
            "role": "user",
            "content": "Do you have a dedicated security team?"
        },
        {
            "role": "assistant",
            "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
        }
    ],
    "reference_answer": {
        "compliant": "No",
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    }
}
```

### Lambda 输出负载示例
<a name="model-customize-evaluation-custom-scorers-lambda-output"></a>

 SageMaker 评估容器希望您的 Lambda 响应遵循以下格式：

```
{
    "id": str,                              # Same id as input sample
    "aggregate_reward_score": float,        # Overall score for the sample
    "metrics_list": [                       # OPTIONAL: Component scores
        {
            "name": str,                    # Name of the component score
            "value": float,                 # Value of the component score
            "type": str                     # "Reward" or "Metric"
        }
    ]
}
```

### 自定义 Lambda 定义
<a name="model-customize-evaluation-custom-scorers-lambda-definition"></a>

在:[https://docs.aws.amazon.com/sagemaker/latest/dg/nova-.html\$1-implementing-reward-functions](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example) example 中查找带有示例输入和预期输出的完全实现的自定义评分器的示例 nova-reward-llm-judge

使用以下骨架作为你自己的函数的起点。

```
def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        Samples: List of dictionaries in OpenAI format
            
        Example input:
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            # This section is the same as your training dataset
            "reference_answer": {
                "compliant": "No",
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."
            }
        }
        
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """
```

### 输入和输出字段
<a name="model-customize-evaluation-custom-scorers-fields"></a>

**输入字段**


| 字段 | 说明 | 附加说明 | 
| --- | --- | --- | 
| id | 样品的唯一标识符 | 在输出中回声。字符串格式 | 
| 消息 | 以 OpenAI 格式排序的聊天记录 | 消息对象数组 | 
| 消息 [] .role | 留言的发言人 | 常用值：“用户”、“助手”、“系统” | 
| 消息 [] .content | 消息的文字内容 | 纯字符串 | 
| 元数据 | 有助于评分的自由格式信息 | 对象；从训练数据传递的可选字段 | 

**输出字段**


**输出字段**  

| 字段 | 说明 | 附加说明 | 
| --- | --- | --- | 
| id | 与输入样本相同的标识符 | 必须匹配输入 | 
| 聚合\$1奖励\$1分数 | 样本的总分数 | 浮动（例如，0.0—1.0 或任务定义的范围） | 
| 指标列表 | 构成汇总的分量分数 | 指标对象数组 | 

### 所需权限
<a name="model-customize-evaluation-custom-scorers-permissions"></a>

确保您用于运行评估的 SageMaker 执行角色具有 AWS Lambda 权限。

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}
```

确保您的 AWS Lambda 函数的执行角色具有基本的 Lambda 执行权限，以及任何下游调用可能需要的额外权限。 AWS 

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
```