本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。 # 使用预设和自定义评分器进行评估使用自定义评分器评估类型时， SageMaker 评估支持两个内置得分器（也称为 “奖励函数”），Prime Math 和 Prime Code 取自 [volcengine/verl](https://github.com/volcengine/verl)RL 训练库，或者您自己的自定义得分器实现为 Lambda 函数。 ## Built-in 得分手 **初级数学** 主要数学得分手期望有一个自定义 JSONL 条目数据集，其中包含数学问题作为基本真相， prompt/query 正确答案作为基本事实。数据集可以是中提到的任何一种支持的格式[Bring-Your-Own-Dataset (BYOD) 任务支持的数据集格式](model-customize-evaluation-dataset-formats.md)。数据集条目示例（为清楚起见进行了扩展）： ``` { "system":"You are a math expert: ", "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?", "response":"2" # Ground truth aka correct answer } ``` **Prime 代码** Prime code scorer 需要一个自定义 JSONL 数据集，该数据集包含该字段中指定的编码问题和测试用例。`metadata`使用每个条目的预期函数名称、样本输入和预期输出来构造测试用例。数据集条目示例（为清楚起见进行了扩展）： ``` { "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n", "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task: \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.", "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1. ### Define test cases in metadata field "metadata": { "fn_name": "factorialNumbers", "inputs": ["5"], "outputs": ["[1, 2]"] } } ``` ## 自定义评分器（自带指标）使用自定义的后处理逻辑完全自定义模型评估工作流程，该逻辑允许您计算根据自己的需求量身定制的自定义指标。您必须将自定义评分器实现为 Lamb AWS da 函数，该函数接受模型响应并返回奖励分数。 ### Lambda 输入负载示例您的自定义 AWS Lambda 要求输入采用 OpenAI 格式。示例： ``` { "id": "123", "messages": [ { "role": "user", "content": "Do you have a dedicated security team?" }, { "role": "assistant", "content": "As an AI developed by Amazon, I do not have a dedicated security team..." } ], "reference_answer": { "compliant": "No", "explanation": "As an AI developed by Company, I do not have a traditional security team..." } } ``` ### Lambda 输出负载示例 SageMaker 评估容器希望您的 Lambda 响应遵循以下格式： ``` { "id": str, # Same id as input sample "aggregate_reward_score": float, # Overall score for the sample "metrics_list": [ # OPTIONAL: Component scores { "name": str, # Name of the component score "value": float, # Value of the component score "type": str # "Reward" or "Metric" } ] } ``` ### 自定义 Lambda 定义 [在以下网址查找带有示例输入和预期输出的完全实现的自定义评分器的示例：https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html\#nova-reward-llm-judge-example](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example) 使用以下骨架作为你自己的函数的起点。 ``` def lambda_handler(event, context): return lambda_grader(event) def lambda_grader(samples: list[dict]) -> list[dict]: """ Args: Samples: List of dictionaries in OpenAI format Example input: { "id": "123", "messages": [ { "role": "user", "content": "Do you have a dedicated security team?" }, { "role": "assistant", "content": "As an AI developed by Company, I do not have a dedicated security team..." } ], # This section is the same as your training dataset "reference_answer": { "compliant": "No", "explanation": "As an AI developed by Company, I do not have a traditional security team..." } } Returns: List of dictionaries with reward scores: { "id": str, # Same id as input sample "aggregate_reward_score": float, # Overall score for the sample "metrics_list": [ # OPTIONAL: Component scores { "name": str, # Name of the component score "value": float, # Value of the component score "type": str # "Reward" or "Metric" } ] } """ ``` ### 输入和输出字段 **输入字段** | 字段 | 说明 | 附加说明 | | --- | --- | --- | | id | 样本的唯一标识符 | 在输出中原样返回。字符串格式 | | 消息 | OpenAI 格式的有序聊天记录 | 消息对象数组 | | messages[].role | 消息发送方 | 常用值：user、assistant、system | | messages[].content | 消息文本内容 | 纯字符串 | | 元数据 | Free-form 有助于评分的信息 | 对象类型；由训练数据传入的可选字段 | **输出字段** **输出字段** | 字段 | 说明 | 附加说明 | | --- | --- | --- | | id | 与输入样本一致的标识符 | 必须与输入匹配 | | aggregate\_reward\_score | 样本综合分数 | 浮点数（如 0.0 – 1.0 或任务自定义区间） | | metrics\_list | 构成综合评分的各单项评分 | 指标对象数组 | ### 所需权限确保您用于运行评估的 SageMaker 执行角色具有 AWS Lambda 权限。 ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "lambda:InvokeFunction" ], "Resource": "arn:aws:lambda:region:account-id:function:function-name" } ] } ``` 确保您的 AWS Lambda 函数的执行角色具有基本的 Lambda 执行权限，以及任何下游调用可能需要的额外权限。 AWS ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "arn:aws:logs:*:*:*" } ] } ```