プリセットおよびカスタムスコアラーで評価する

カスタムスコアラー評価タイプを使用する場合、SageMaker 評価は、volcengine/verl RL トレーニングライブラリから取得した 2 つの組み込みスコアラー (「報酬関数」とも呼ばれます) Prime Math と Prime Code、または Lambda 関数として実装された独自のカスタムスコアラーをサポートします。

組み込みスコアラー

Prime Math

素数学スコアラーは、数学の質問を含むエントリのカスタム JSONL データセットをプロンプト/クエリとして想定し、正しい回答をグラウンドトゥルースとして想定します。データセットは、「」で説明されているサポートされている形式のいずれかにすることができますBring-Your-Own-Dataset (BYOD) タスクでサポートされているデータセット形式。

データセットエントリの例 (わかりやすくするために拡張):


{
    "system":"You are a math expert: ",
    "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    "response":"2" # Ground truth aka correct answer
}

Prime コード

Prime Code Scorer は、 metadataフィールドで指定されたコーディング問題とテストケースを含むエントリのカスタム JSONL データセットを想定しています。各エントリ、サンプル入力、および期待される出力に期待される関数名を使用してテストケースを構造化します。

データセットエントリの例 (わかりやすくするために拡張):


{
    "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n",
    "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task:  \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.",
    "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1.
    ### Define test cases in metadata field
    "metadata": {
        "fn_name": "factorialNumbers",
        "inputs": ["5"],
        "outputs": ["[1, 2]"]
    }
}

カスタムスコアラー (独自のメトリクスを使用)

ニーズに合わせてカスタマイズされたカスタムメトリクスを計算できるカスタム後処理ロジックを使用して、モデル評価ワークフローを完全にカスタマイズします。モデルレスポンスを受け入れて報酬スコアを返す AWS Lambda 関数としてカスタムスコアラーを実装する必要があります。

Lambda 入力ペイロードのサンプル

カスタム AWS Lambda は OpenAI 形式の入力を想定しています。例:


{
    "id": "123",
    "messages": [
        {
            "role": "user",
            "content": "Do you have a dedicated security team?"
        },
        {
            "role": "assistant",
            "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
        }
    ],
    "reference_answer": {
        "compliant": "No",
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    }
}

Lambda 出力ペイロードのサンプル

SageMaker 評価コンテナは、Lambda レスポンスが次の形式に従うことを想定しています。


{
    "id": str,                              # Same id as input sample
    "aggregate_reward_score": float,        # Overall score for the sample
    "metrics_list": [                       # OPTIONAL: Component scores
        {
            "name": str,                    # Name of the component score
            "value": float,                 # Value of the component score
            "type": str                     # "Reward" or "Metric"
        }
    ]
}

カスタム Lambda 定義

サンプル入力と予想される出力を備えた完全に実装されたカスタムスコアラーの例については、https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example を参照してください。

次のスケルトンを独自の関数の開始点として使用します。


def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        Samples: List of dictionaries in OpenAI format
            
        Example input:
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            # This section is the same as your training dataset
            "reference_answer": {
                "compliant": "No",
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."
            }
        }
        
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """

入力フィールドと出力フィールド

入力フィールド

フィールド	説明	追加のメモ
id	サンプルの一意の識別子	出力にエコーバックされます。文字列の形式
messages	OpenAI 形式の順序付けられたチャット履歴	メッセージオブジェクトの配列
messages[].role	メッセージの発話者	一般的な値:「user」、「assistant」、「system」
messages[].content	メッセージのテキストコンテンツ	プレーン文字列
メタデータ	グレーディングに役立つ自由形式の情報	オブジェクト、トレーニングデータから渡されるオプションのフィールド

出力フィールド

出力フィールド
フィールド	説明	追加のメモ
id	入力サンプルと同じ識別子	入力と一致する必要があります
aggregate_reward_score	サンプルの全体的なスコア	浮動小数点数 (例: 0.0～1.0 またはタスク定義の範囲)
metrics_list	集計を構成するコンポーネントスコア	メトリクスオブジェクトの配列

必要な許可

評価の実行に使用する SageMaker 実行ロールに AWS Lambda アクセス許可があることを確認します。


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}

AWS Lambda 関数の実行ロールに、基本的な Lambda 実行アクセス許可と、ダウンストリーム AWS 呼び出しに必要な追加のアクセス許可があることを確認します。


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

Bring-Your-Own-Dataset (BYOD) タスクでサポートされているデータセット形式

モデルのデプロイ