本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。 # 使用預設和自訂計分器進行評估使用自訂計分器評估類型時，SageMaker Evaluation 支援兩個內建計分器（也稱為「獎勵函數」) 從 [volcengine/verl](https://github.com/volcengine/verl) RL 訓練程式庫取得的 Prime Math 和 Prime Code，或您自己的自訂計分器實作為 Lambda 函數。 ## 內建計分器 **主要數學** 主要數學計分器預期項目的自訂 JSONL 資料集，其中包含數學問題作為提示/查詢，而正確答案作為基本事實。資料集可以是中提及的任何支援格式[Bring-Your-Own-Dataset (BYOD) 任務支援的資料集格式](model-customize-evaluation-dataset-formats.md)。資料集項目範例（為了清楚起見而擴展）： ``` { "system":"You are a math expert: ", "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?", "response":"2" # Ground truth aka correct answer } ``` **主要程式碼** 主要程式碼評分器預期自訂 JSONL 資料集的項目，其中包含 `metadata` 欄位中指定的編碼問題和測試案例。使用每個項目的預期函數名稱、範例輸入和預期輸出來建構測試案例。資料集項目範例（為了清楚起見而擴展）： ``` { "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n", "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task: \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.", "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1. ### Define test cases in metadata field "metadata": { "fn_name": "factorialNumbers", "inputs": ["5"], "outputs": ["[1, 2]"] } } ``` ## 自訂計分器（使用您自己的指標）使用自訂後製處理邏輯來完全自訂模型評估工作流程，可讓您根據需求來計算自訂指標。您必須將自訂計分器實作為接受模型回應並傳回獎勵分數的 AWS Lambda 函數。 ### Lambda 輸入承載範例您的自訂 AWS Lambda 預期會以 OpenAI 格式輸入。範例： ``` { "id": "123", "messages": [ { "role": "user", "content": "Do you have a dedicated security team?" }, { "role": "assistant", "content": "As an AI developed by Amazon, I do not have a dedicated security team..." } ], "reference_answer": { "compliant": "No", "explanation": "As an AI developed by Company, I do not have a traditional security team..." } } ``` ### Lambda 輸出承載範例 SageMaker 評估容器預期您的 Lambda 回應遵循此格式： ``` { "id": str, # Same id as input sample "aggregate_reward_score": float, # Overall score for the sample "metrics_list": [ # OPTIONAL: Component scores { "name": str, # Name of the component score "value": float, # Value of the component score "type": str # "Reward" or "Metric" } ] } ``` ### 自訂 Lambda 定義在 https：//[https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html\#nova-reward-llm-judge-example](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example) 尋找具有範例輸入和預期輸出的完全實作自訂評分器範例使用下列骨架做為您自己的函數的起點。 ``` def lambda_handler(event, context): return lambda_grader(event) def lambda_grader(samples: list[dict]) -> list[dict]: """ Args: Samples: List of dictionaries in OpenAI format Example input: { "id": "123", "messages": [ { "role": "user", "content": "Do you have a dedicated security team?" }, { "role": "assistant", "content": "As an AI developed by Company, I do not have a dedicated security team..." } ], # This section is the same as your training dataset "reference_answer": { "compliant": "No", "explanation": "As an AI developed by Company, I do not have a traditional security team..." } } Returns: List of dictionaries with reward scores: { "id": str, # Same id as input sample "aggregate_reward_score": float, # Overall score for the sample "metrics_list": [ # OPTIONAL: Component scores { "name": str, # Name of the component score "value": float, # Value of the component score "type": str # "Reward" or "Metric" } ] } """ ``` ### 輸入和輸出欄位 **輸入欄位** | 欄位 | 說明 | 其他備註 | | --- | --- | --- | | id | 範例的唯一識別符 | 在輸出中回呼。字串格式 | | messages | 以 OpenAI 格式排序的聊天歷史記錄 | 訊息物件陣列 | | messages【】.role | 訊息的發言者 | 常見值："user"、"assistant"、"system" | | messages【】.content | 訊息的文字內容 | 純文字的字串 | | 中繼資料 | 協助分級的自由格式資訊 | 物件；從訓練資料傳遞的選用欄位 | **輸出欄位** **輸出欄位** | 欄位 | 說明 | 其他備註 | | --- | --- | --- | | id | 與輸入範例相同的識別符 | 必須符合輸入 | | aggregate\_reward\_score | 範例的整體分數 | 浮動（例如 0.0–1.0 或任務定義範圍） | | metrics\_list | 組成彙總的元件分數 | 指標物件陣列 | ### 所需的許可確定您用來執行評估的 SageMaker 執行角色具有 AWS Lambda 許可。 ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "lambda:InvokeFunction" ], "Resource": "arn:aws:lambda:region:account-id:function:function-name" } ] } ``` 確保您的 AWS Lambda 函數的執行角色具有基本的 Lambda 執行許可，以及任何下游 AWS 呼叫可能需要的額外許可。 ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "arn:aws:logs:*:*:*" } ] } ```