使用预设和自定义评分器进行评估

使用自定义评分器评估类型时， SageMaker 评估支持两个内置得分器（也称为 “奖励函数”），Prime Math 和 Prime Code 取自 volcengine/verlRL 训练库，或者您自己的自定义得分器实现为 Lambda 函数。

Built-in 得分手

初级数学

主要数学得分手期望有一个自定义 JSONL 条目数据集，其中包含数学问题作为基本真相， prompt/query 正确答案作为基本事实。数据集可以是中提到的任何一种支持的格式Bring-Your-Own-Dataset (BYOD) 任务支持的数据集格式。

数据集条目示例（为清楚起见进行了扩展）：


{
    "system":"You are a math expert: ",
    "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    "response":"2" # Ground truth aka correct answer
}

Prime 代码

Prime code scorer 需要一个自定义 JSONL 数据集，该数据集包含该字段中指定的编码问题和测试用例。metadata使用每个条目的预期函数名称、样本输入和预期输出来构造测试用例。

数据集条目示例（为清楚起见进行了扩展）：


{
    "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n",
    "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task:  \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.",
    "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1.
    ### Define test cases in metadata field
    "metadata": {
        "fn_name": "factorialNumbers",
        "inputs": ["5"],
        "outputs": ["[1, 2]"]
    }
}

自定义评分器（自带指标）

使用自定义的后处理逻辑完全自定义模型评估工作流程，该逻辑允许您计算根据自己的需求量身定制的自定义指标。您必须将自定义评分器实现为 Lamb AWS da 函数，该函数接受模型响应并返回奖励分数。

Lambda 输入负载示例

您的自定义 AWS Lambda 要求输入采用 OpenAI 格式。示例：


{
    "id": "123",
    "messages": [
        {
            "role": "user",
            "content": "Do you have a dedicated security team?"
        },
        {
            "role": "assistant",
            "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
        }
    ],
    "reference_answer": {
        "compliant": "No",
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    }
}

Lambda 输出负载示例

SageMaker 评估容器希望您的 Lambda 响应遵循以下格式：


{
    "id": str,                              # Same id as input sample
    "aggregate_reward_score": float,        # Overall score for the sample
    "metrics_list": [                       # OPTIONAL: Component scores
        {
            "name": str,                    # Name of the component score
            "value": float,                 # Value of the component score
            "type": str                     # "Reward" or "Metric"
        }
    ]
}

自定义 Lambda 定义

在以下网址查找带有示例输入和预期输出的完全实现的自定义评分器的示例：https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example

使用以下骨架作为你自己的函数的起点。


def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        Samples: List of dictionaries in OpenAI format
            
        Example input:
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            # This section is the same as your training dataset
            "reference_answer": {
                "compliant": "No",
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."
            }
        }
        
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """

输入和输出字段

输入字段

字段	说明	附加说明
id	样本的唯一标识符	在输出中原样返回。字符串格式
消息	OpenAI 格式的有序聊天记录	消息对象数组
messages[].role	消息发送方	常用值：user、assistant、system
messages[].content	消息文本内容	纯字符串
元数据	Free-form 有助于评分的信息	对象类型；由训练数据传入的可选字段

输出字段

输出字段
字段	说明	附加说明
id	与输入样本一致的标识符	必须与输入匹配
aggregate_reward_score	样本综合分数	浮点数（如 0.0 – 1.0 或任务自定义区间）
metrics_list	构成综合评分的各单项评分	指标对象数组

所需权限

确保您用于运行评估的 SageMaker 执行角色具有 AWS Lambda 权限。


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}

确保您的 AWS Lambda 函数的执行角色具有基本的 Lambda 执行权限，以及任何下游调用可能需要的额外权限。 AWS


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

Bring-Your-Own-Dataset (BYOD) 任务支持的数据集格式

模型部署