本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
使用预设和自定义评分器进行评估
使用自定义评分器评估类型时, SageMaker 评估支持两个内置得分器(也称为 “奖励函数”)Prime Math 和 Prime Code,取自 v olcengine/verl RL
内置记分器
初级数学
主要数学得分手期望有一个自定义 JSONL 条目数据集,其中包含数学问题作为基本真相, prompt/query 正确答案作为基本事实。数据集可以是中提到的任何一种支持的格式Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式。
数据集条目示例(为清楚起见进行了扩展):
{ "system":"You are a math expert: ", "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?", "response":"2" # Ground truth aka correct answer }
Prime 代码
Prime code scorer 需要一个自定义 JSONL 数据集,该数据集包含该字段中指定的编码问题和测试用例。metadata使用每个条目的预期函数名称、样本输入和预期输出来构造测试用例。
数据集条目示例(为清楚起见进行了扩展):
{ "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n", "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task: \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.", "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1. ### Define test cases in metadata field "metadata": { "fn_name": "factorialNumbers", "inputs": ["5"], "outputs": ["[1, 2]"] } }
自定义评分器(自带指标)
使用自定义的后处理逻辑完全自定义您的模型评估工作流程,该逻辑允许您计算根据自己的需求量身定制的自定义指标。您必须将自定义评分器实现为 Lamb AWS da 函数,该函数接受模型响应并返回奖励分数。
Lambda 输入负载示例
您的自定义 AWS Lambda 要求输入采用 OpenAI 格式。示例:
{ "id": "123", "messages": [ { "role": "user", "content": "Do you have a dedicated security team?" }, { "role": "assistant", "content": "As an AI developed by Amazon, I do not have a dedicated security team..." } ], "reference_answer": { "compliant": "No", "explanation": "As an AI developed by Company, I do not have a traditional security team..." } }
Lambda 输出负载示例
SageMaker 评估容器希望您的 Lambda 响应遵循以下格式:
{ "id": str, # Same id as input sample "aggregate_reward_score": float, # Overall score for the sample "metrics_list": [ # OPTIONAL: Component scores { "name": str, # Name of the component score "value": float, # Value of the component score "type": str # "Reward" or "Metric" } ] }
自定义 Lambda 定义
在:https://docs.aws.amazon.com/sagemaker/latest/dg/nova-.html#-implementing-reward-functions example 中查找带有示例输入和预期输出的完全实现的自定义评分器的示例 nova-reward-llm-judge
使用以下骨架作为你自己的函数的起点。
def lambda_handler(event, context): return lambda_grader(event) def lambda_grader(samples: list[dict]) -> list[dict]: """ Args: Samples: List of dictionaries in OpenAI format Example input: { "id": "123", "messages": [ { "role": "user", "content": "Do you have a dedicated security team?" }, { "role": "assistant", "content": "As an AI developed by Company, I do not have a dedicated security team..." } ], # This section is the same as your training dataset "reference_answer": { "compliant": "No", "explanation": "As an AI developed by Company, I do not have a traditional security team..." } } Returns: List of dictionaries with reward scores: { "id": str, # Same id as input sample "aggregate_reward_score": float, # Overall score for the sample "metrics_list": [ # OPTIONAL: Component scores { "name": str, # Name of the component score "value": float, # Value of the component score "type": str # "Reward" or "Metric" } ] } """
输入和输出字段
输入字段
| 字段 | 说明 | 附加说明 |
|---|---|---|
| id | 样品的唯一标识符 | 在输出中回声。字符串格式 |
| 消息 | 以 OpenAI 格式排序的聊天记录 | 消息对象数组 |
| 消息 [] .role | 留言的发言人 | 常用值:“用户”、“助手”、“系统” |
| 消息 [] .content | 消息的文字内容 | 纯字符串 |
| 元数据 | 有助于评分的自由格式信息 | 对象;从训练数据传递的可选字段 |
输出字段
| 字段 | 说明 | 附加说明 |
|---|---|---|
| id | 与输入样本相同的标识符 | 必须匹配输入 |
| 聚合_奖励_分数 | 样本的总分数 | 浮动(例如,0.0—1.0 或任务定义的范围) |
| 指标列表 | 构成汇总的分量分数 | 指标对象数组 |
所需权限
确保您用于运行评估的 SageMaker 执行角色具有 AWS Lambda 权限。
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "lambda:InvokeFunction" ], "Resource": "arn:aws:lambda:region:account-id:function:function-name" } ] }
确保您的 AWS Lambda 函数的执行角色具有基本的 Lambda 执行权限,以及任何下游调用可能需要的额外权限。 AWS
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "arn:aws:logs:*:*:*" } ] }