

翻訳は機械翻訳により提供されています。提供された翻訳内容と英語版の間で齟齬、不一致または矛盾がある場合、英語版が優先します。

# Amazon SageMaker AI を使用したモデルのカスタマイズ
<a name="customize-model"></a>

Amazon SageMaker AI モデルのカスタマイズは、AI モデルをカスタマイズする従来の複雑で時間のかかるプロセスを、数か月にわたる取り組みから、数日で完了できる合理化されたワークフローに変換する機能です。この機能は、独自データを使用して基盤モデルをカスタマイズして高度に差別化されたカスタマーエクスペリエンスを作成する必要がある AI 開発者が直面する重要な課題に対処します。step-by-stepガイドや高度な設定オプションを含む詳細なカスタマイズドキュメントは、この SageMaker AI ガイドに記載されています。Nova モデルのカスタマイズの概要については、「Amazon Nova ユーザーガイド」の[SageMaker を使用したカスタマイズと微調整](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-sagemaker.html)」を参照してください。

機能には、自然言語の要件を理解する新しいガイド付きユーザーインターフェイスと、コンピューティングリソースを管理する運用上のオーバーヘッドを排除するサーバーレスインフラストラクチャを搭載した高度なモデルカスタマイズ技術の包括的なスイートが含まれています。法的調査アプリケーションの構築、カスタマーサービスのチャットボットの強化、ドメイン固有の AI エージェントの開発など、この機能はproof-of-conceptから本番デプロイまでの道のりを加速します。

Amazon Bedrock Evaluations によるモデルカスタマイズの機能は、処理のために地理的にデータを安全に転送 AWS リージョン することがあります。詳細については、[Amazon Bedrock Evaluations のドキュメント](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html)を参照してください。

## 主要なコンセプト
<a name="model-customize-concepts"></a>

**サーバーレストレーニング**

すべてのインフラストラクチャの複雑さを抽象化するフルマネージド型のコンピューティングインフラストラクチャ。モデル開発に専念できます。これには、モデルサイズとトレーニング要件に基づく GPU インスタンス (P5、P4de, P4d, G5) の自動プロビジョニング、各カスタマイズ手法のベストプラクティスを組み込んだ事前最適化されたトレーニングレシピ、UI からアクセスできるライブメトリクスとログによるリアルタイムモニタリング、トレーニング完了後のリソースの自動クリーンアップによるコストの最適化が含まれます。

**モデルのカスタマイズ手法**

教師ありファインチューニング (SFT)、直接設定最適化 (DPO)、検証可能な報酬による強化学習 (RLVR)、AI フィードバックによる強化学習 (RLAIF) などの包括的な高度なメソッドのセット。

**カスタムモデル**

独自のデータでトレーニングすることで特定のユースケースに適合したベース基盤モデルの特殊なバージョン。元の基盤モデルの一般的な機能を保持しながら、要件に合わせたドメイン固有の知識、用語、スタイル、動作を追加する AI モデルになります。

**AI モデルのカスタマイズアセット**

モデルのカスタマイズプロセス中にカスタムモデルのトレーニング、改良、評価に使用されるリソースとアーティファクト。これらのアセットには**、データセット**、 トレーニング例 (プロンプトとレスポンスのペア、 ドメイン固有のテキスト、 またはラベル付きデータ) は、基盤モデルを微調整して特定の動作を学習するために使用されます。 ナレッジ、 または スタイル、 と**評価者**、 は、***報酬関数*** (特定の基準に基づいてモデル出力をスコアリングするコードベースのロジック、 RLVR トレーニングとカスタムスコアラー評価で使用) または***報酬プロンプト*** (LLM がモデルレスポンスの品質を判断するようにガイドする自然言語の手順、 RLAIF トレーニングと LLM-as-a-judge 評価で使用されます）。

**モデルパッケージグループ**

トレーニングジョブからログに記録されたすべてのモデルを追跡するコレクションコンテナ。モデルバージョンとその系統の一元化された場所を提供します。

**ログに記録されたモデル**

サーバーレストレーニングジョブの実行時に SageMaker AI によって作成される出力。これは、微調整されたモデル (成功したジョブ）、チェックポイント (チェックポイントのある失敗したジョブ）、または関連するメタデータ (チェックポイントのない失敗したジョブ) です。

**登録済みモデル**

正式な追跡とガバナンスの目的でマークされ、完全な系統とライフサイクル管理を可能にする、ログに記録されたモデル。

**系統**

SageMaker AI と Amazon Bedrock 全体でトレーニングジョブ、入力データセット、出力モデル、評価ジョブ、デプロイ間の関係を自動的にキャプチャします。

**クロスアカウント共有**

Resource Access Manager (RAM) を使用して AWS AWS アカウント間でモデル、データセット、評価者を共有しながら、系統の完全な可視性を維持する機能。

# Amazon SageMaker AI での Amazon Nova モデルのカスタマイズ
<a name="nova-model"></a>

このトピックは移動しました。最新情報については、「Amazon Nova [1.0 ユーザーガイド」または「Amazon Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-model.html) 2.0 ユーザーガイド」の「Amazon SageMaker AI での Amazon Nova モデルのカスタマイズ」を参照してください。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model.html)

# Amazon Nova レシピ
<a name="nova-model-recipes"></a>

このトピックは移動しました。最新情報については、「Amazon Nova [1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-recipes.html)」または「Amazon Nova [2.0 ユーザーガイド」の「Amazon Nova](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-recipes.html) recipes」を参照してください。

# SageMaker トレーニングジョブでの Amazon Nova のカスタマイズ
<a name="nova-model-training-job"></a>

このトピックは移動しました。最新情報については、「Amazon Nova [1.0 ユーザーガイド」または「Amazon Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-training-job.html) 2.0 ユーザーガイド」のSageMaker トレーニングジョブでの Amazon Nova のカスタマイズ」を参照してください。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-training-job.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-training-job.html)

**Topics**
+ [Amazon Nova の蒸留](nova-distillation.md)
+ [Nova Customization SDK](nova-customization-sdk.md)
+ [SageMaker トレーニングジョブを使用した Amazon Nova モデルのファインチューニング](nova-fine-tuning-training-job.md)
+ [各イテレーションにおける進行状況のモニタリング](nova-model-monitor.md)
+ [SageMaker AI でトレーニングされたモデルの評価](nova-model-evaluation.md)

# Amazon Nova の蒸留
<a name="nova-distillation"></a>

このトピックは移動しました。最新情報については、「Amazon Nova [1.0 ユーザーガイド」の「Amazon Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-distillation.html) distillation」を参照してください。

# Nova Customization SDK
<a name="nova-customization-sdk"></a>

Amazon Nova Customization SDK は、トレーニングと評価からデプロイと推論まで、ライフサイクル全体で Amazon Nova モデルをカスタマイズするための包括的な Python SDK です。

このトピックは移動しました。最新情報については、[「Amazon Nova 1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-customization-sdk.html)」または[「Amazon Nova 2.0 ユーザーガイド」の「Nova Customization SDK](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-customization-sdk.html)」を参照してください。

# SageMaker トレーニングジョブを使用した Amazon Nova モデルのファインチューニング
<a name="nova-fine-tuning-training-job"></a>

このトピックは移動しました。最新情報については、「Amazon Nova [1.0 ユーザーガイド」または「Amazon Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-fine-tune-1.html) 2.0 ユーザーガイド」のSageMaker トレーニングジョブを使用した Amazon Nova モデルのファインチューニング」を参照してください。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/smtj-training.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/smtj-training.html)

# 各イテレーションにおける進行状況のモニタリング
<a name="nova-model-monitor"></a>

このトピックは移動しました。最新情報については、[「Amazon Nova 1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-monitor.html)」または「Amazon [Nova 2.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-monitor.html)」の「反復間の進行状況のモニタリング」を参照してください。

# SageMaker AI でトレーニングされたモデルの評価
<a name="nova-model-evaluation"></a>

このトピックは移動しました。最新情報については、[「Amazon Nova 1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-evaluation.html)」または「Amazon [Nova 2.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-evaluation.html)」のSageMaker AI トレーニング済みモデルの評価」を参照してください。

# Amazon SageMaker HyperPod での Amazon Nova のカスタマイズ
<a name="nova-hp"></a>

このトピックは移動しました。最新情報については、「Amazon Nova [1.0 ユーザーガイド」または「Amazon Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html) 2.0 ユーザーガイド」の「Amazon SageMaker HyperPod での Amazon Nova のカスタマイズ」を参照してください。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html)

# Nova Customization SDK
<a name="nova-hp-customization-sdk"></a>

このトピックは移動しました。最新情報については、[「Amazon Nova 1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-customization-sdk.html)」の「Nova Customization SDK」または[「Amazon Nova 2.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-customization-sdk.html)」を参照してください。

# Amazon HyperPod Essential コマンドガイド
<a name="nova-hp-essential-commands-guide"></a>

このトピックは移動しました。最新情報については、[「Amazon Nova 1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-essential-commands-guide.html)」または「Amazon [Nova 2.0 ](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-essential-commands-guide.html)ユーザーガイド」のSageMaker HyperPod Essential Commands Guide」を参照してください。

# 制限されたインスタンスグループ (RIG) を使用した HyperPod EKS クラスターの作成
<a name="nova-hp-cluster"></a>

このトピックは移動しました。最新情報については、[「Amazon Nova 1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-cluster.html)」または「Amazon Nova [2](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-cluster.html).0 ユーザーガイド」の「RIG を使用した SageMaker HyperPod EKS クラスターの作成」を参照してください。

# SageMaker AI HyperPod の Nova Forge アクセスとセットアップ
<a name="nova-forge-hp-access"></a>

このトピックは移動しました。最新情報については、Amazon Nova [1.0 ユーザーガイドまたは Amazon Nova 2.0](https://docs.aws.amazon.com//nova/latest/userguide/nova-forge-hp-access.html) ユーザーガイドのSageMaker AI HyperPod の Nova Forge アクセスとセットアップ」を参照してください。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-forge-hp-access.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-forge-hp-access.html)

# Amazon Nova モデルのトレーニング
<a name="nova-hp-training"></a>

このトピックは移動しました。最新情報については、「Amazon Nova [1.0 ユーザーガイド」または「Amazon Nova ](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-training.html)[2.0 ユーザーガイド」の「Amazon Nova ](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-training.html)モデルのトレーニング」を参照してください。

# Amazon SageMaker HyperPod での Amazon Nova モデルのファインチューニング HyperPod
<a name="nova-hp-fine-tune"></a>

このトピックは移動しました。最新情報については、Amazon Nova 1.0 ユーザーガイドのAmazon SageMaker HyperPod での Amazon Nova モデルのファインチューニング」を参照してください。 [https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-fine-tune.html](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-fine-tune.html)

# トレーニング済みモデルの評価
<a name="nova-hp-evaluate"></a>

このトピックは移動しました。最新情報については、[「Amazon Nova 1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-evaluate.html)」または[「Amazon Nova 2.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-evaluate.html)」の「トレーニング済みモデルの評価」を参照してください。

# 反復トレーニング
<a name="nova-iterative-training"></a>

反復トレーニングを使用すると、複数のトレーニングサイクルを通じてモデルのパフォーマンスを向上させ、以前のチェックポイントに基づいて障害モードに体系的に対処し、要件の変化に適応できます。

このトピックは移動しました。最新情報については、[「Amazon Nova 1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-iterative-training.html)」または[「Amazon Nova 2.0 ユーザーガイド」の「反復トレーニング](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-iterative-training.html)」を参照してください。

# Amazon Bedrock 推論
<a name="nova-model-bedrock-inference"></a>

このトピックは移動しました。最新情報については、「Amazon [Nova 1.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-bedrock-inference.html)」または「Amazon [Nova 2.0 ユーザーガイド](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-bedrock-inference.html)」の「Amazon Bedrock 推論」を参照してください。

# Amazon Nova モデルをカスタマイズする際の制限
<a name="nova-model-limitations"></a>

このトピックは移動しました。最新情報については、「Amazon Nova [2.0 ユーザーガイド」の「Amazon Nova](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-limitations.html) モデルのカスタマイズの制限」を参照してください。

# オープンウェイトモデルのカスタマイズ
<a name="model-customize-open-weight"></a>

このセクションでは、オープンウェイトモデルのカスタマイズを開始するプロセスについて説明します。

**Topics**
+ [前提条件](model-customize-open-weight-prereq.md)
+ [UI でのモデルカスタマイズ用のアセットの作成](model-customize-open-weight-create-assets-ui.md)
+ [AI モデルのカスタマイズジョブの送信](model-customize-open-weight-job.md)
+ [モデル評価ジョブの送信](model-customize-open-weight-evaluation.md)
+ [モデルのデプロイ](model-customize-open-weight-deployment.md)
+ [サンプルデータセットと評価者](model-customize-open-weight-samples.md)

# 前提条件
<a name="model-customize-open-weight-prereq"></a>

開始する前に、次の前提条件を完了します。
+ Studio へのアクセス許可を使用して SageMaker AI ドメインにオンボードします。Studio をドメインのデフォルトエクスペリエンスとして設定するアクセス許可がない場合は、管理者にお問い合わせください。詳細については、[Amazon SageMakerドメインの概要](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html)」を参照してください。
+ [現在の AWS CLI バージョンのインストール](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled)の AWS CLI 手順に従って、 を更新します。
+ ローカルマシンで `aws configure` を実行し、 AWS 認証情報を入力します。 AWS 認証情報の詳細については、[AWS 「認証情報の理解と取得](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html)」を参照してください。

## 必要な IAM 許可
<a name="model-customize-open-weight-iam"></a>

SageMaker AI モデルのカスタマイズでは、SageMaker AI ドメイン実行に適切なアクセス許可を追加する必要があります。これを行うには、インライン IAM アクセス許可ポリシーを作成し、IAM ロールにアタッチします。ポリシーの追加の詳細については、[「 Identity and Access Management ユーザーガイド」の「IAM ID アクセス許可の追加と削除](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html)」を参照してください。 *AWS *

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowNonAdminStudioActions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeDomain",
                "sagemaker:DescribeUserProfile",
                "sagemaker:DescribeSpace",
                "sagemaker:ListSpaces",
                "sagemaker:DescribeApp",
                "sagemaker:ListApps"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:domain/*",
                "arn:aws:sagemaker:*:*:user-profile/*",
                "arn:aws:sagemaker:*:*:app/*",
                "arn:aws:sagemaker:*:*:space/*"
             ]
        },
        {
            "Sid": "LambdaListPermissions",
            "Effect": "Allow",
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "LambdaPermissionsForRewardFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunction",
                "lambda:GetFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*SageMaker*",
                "arn:aws:lambda:*:*:function:*sagemaker*",
                "arn:aws:lambda:*:*:function:*Sagemaker*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LambdaLayerForAWSSDK",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:aws:lambda:*:336392948345:layer:AWSSDK*"
            ]
        },
        {
            "Sid": "SageMakerPublicHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListHubContents"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:aws:hub/SageMakerPublicHub"
            ]
        },
        {
            "Sid": "SageMakerHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListHubs",
                "sagemaker:ListHubContents",
                "sagemaker:DescribeHubContent",
                "sagemaker:DeleteHubContent",
                "sagemaker:ListHubContentVersions",
                "sagemaker:Search"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "JumpStartAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::jumpstart*"
            ]
        },
        {
            "Sid": "ListMLFlowOperations",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListMlflowApps",
                "sagemaker:ListMlflowTrackingServers"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "MLFlowAccess",
            "Effect": "Allow",
            "Action": [
                "sagemaker:UpdateMlflowApp",
                "sagemaker:DescribeMlflowApp",
                "sagemaker:CreatePresignedMlflowAppUrl",
                "sagemaker:CallMlflowAppApi",
                "sagemaker-mlflow:*"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:mlflow-app/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BYODataSetS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*"
            ]
        },
        {
            "Sid": "AllowHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ImportHubContent"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:hub/*",
                "arn:aws:sagemaker:*:*:hub-content/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForSageMaker",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "sagemaker.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForAWSLambda",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "lambda.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForBedrock",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "bedrock.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "TrainingJobRun",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateTrainingJob",
                "sagemaker:DescribeTrainingJob",
                "sagemaker:ListTrainingJobs"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:training-job/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "ModelPackageAccess",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateModelPackage",
                "sagemaker:DescribeModelPackage",
                "sagemaker:ListModelPackages",
                "sagemaker:CreateModelPackageGroup",
                "sagemaker:DescribeModelPackageGroup",
                "sagemaker:ListModelPackageGroups",
                "sagemaker:CreateModel"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:model-package-group/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "TagsPermission",
            "Effect": "Allow",
            "Action": [
                "sagemaker:AddTags",
                "sagemaker:ListTags"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:model-package-group/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:hub/*",
                "arn:aws:sagemaker:*:*:hub-content/*",
                "arn:aws:sagemaker:*:*:training-job/*",
                "arn:aws:sagemaker:*:*:model/*",
                "arn:aws:sagemaker:*:*:endpoint/*",
                "arn:aws:sagemaker:*:*:endpoint-config/*",
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:inference-component/*",
                "arn:aws:sagemaker:*:*:action/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LogAccess",
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams",
                "logs:GetLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:log-group*",
                "arn:aws:logs:*:*:log-group:/aws/sagemaker/TrainingJobs:log-stream:*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockDeploy",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelImportJob"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetModelImportJob",
                "bedrock:GetImportedModel",
                "bedrock:ListProvisionedModelThroughputs",
                "bedrock:ListCustomModelDeployments",
                "bedrock:ListCustomModels",
                "bedrock:ListModelImportJobs",
                "bedrock:GetEvaluationJob",
                "bedrock:CreateEvaluationJob", 
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:evaluation-job/*",
                "arn:aws:bedrock:*:*:imported-model/*",
                "arn:aws:bedrock:*:*:model-import-job/*",
                "arn:aws:bedrock:*:*:foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockFoundationModelOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetFoundationModelAvailability",
                "bedrock:ListFoundationModels"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "SageMakerPipelinesAndLineage",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListActions",
                "sagemaker:ListArtifacts",
                "sagemaker:QueryLineage",
                "sagemaker:ListAssociations",
                "sagemaker:AddAssociation",
                "sagemaker:DescribeAction",
                "sagemaker:AddAssociation",
                "sagemaker:CreateAction",
                "sagemaker:CreateContext",
                "sagemaker:DescribeTrialComponent"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:artifact/*",
                "arn:aws:sagemaker:*:*:action/*",
                "arn:aws:sagemaker:*:*:context/*",
                "arn:aws:sagemaker:*:*:action/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:context/*",
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:experiment-trial-component/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "ListOperations",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListInferenceComponents",
                "sagemaker:ListWorkforces"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "SageMakerInference",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeInferenceComponent",
                "sagemaker:CreateEndpoint",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:DescribeEndpoint",
                "sagemaker:DescribeEndpointConfig",
                "sagemaker:ListEndpoints"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:inference-component/*",
                "arn:aws:sagemaker:*:*:endpoint/*",
                "arn:aws:sagemaker:*:*:endpoint-config/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "SageMakerPipelines",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribePipelineExecution",
                "sagemaker:ListPipelineExecutions",
                "sagemaker:ListPipelineExecutionSteps",
                "sagemaker:CreatePipeline",
                "sagemaker:UpdatePipeline",
                "sagemaker:StartPipelineExecution"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:pipeline/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        }
    ]
}
```

[AmazonSageMakerFullAccessPolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) を実行ロールにアタッチしている場合は、この削減ポリシーを追加できます。

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "LambdaListPermissions",
            "Effect": "Allow",
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "LambdaPermissionsForRewardFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunction",
                "lambda:GetFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*SageMaker*",
                "arn:aws:lambda:*:*:function:*sagemaker*",
                "arn:aws:lambda:*:*:function:*Sagemaker*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LambdaLayerForAWSSDK",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:aws:lambda:*:336392948345:layer:AWSSDK*"
            ]
        },
        {
            "Sid": "S3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*",
                "arn:aws:s3:::jumpstart*"
            ]
        },
        {
            "Sid": "PassRoleForSageMakerAndLambdaAndBedrock",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": { 
                "StringEquals": { 
                    "iam:PassedToService": [ 
                        "lambda.amazonaws.com", 
                        "bedrock.amazonaws.com"
                     ],
                     "aws:ResourceAccount": "${aws:PrincipalAccount}" 
                 } 
            }
        },
        {
            "Sid": "BedrockDeploy",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelImportJob"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetModelImportJob",
                "bedrock:GetImportedModel",
                "bedrock:ListProvisionedModelThroughputs",
                "bedrock:ListCustomModelDeployments",
                "bedrock:ListCustomModels",
                "bedrock:ListModelImportJobs",
                "bedrock:GetEvaluationJob",
                "bedrock:CreateEvaluationJob",
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:evaluation-job/*",
                "arn:aws:bedrock:*:*:imported-model/*",
                "arn:aws:bedrock:*:*:model-import-job/*",
                "arn:aws:bedrock:*:*:foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockFoundationModelOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetFoundationModelAvailability",
                "bedrock:ListFoundationModels"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```

次に、**信頼ポリシーの編集**をクリックして次のポリシーに置き換え、**ポリシーの更新**をクリックします。

```
{
    "Version": "2012-10-17",		 	 	                    
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                 "Service": "lambda.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                   "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                  "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

# UI でのモデルカスタマイズ用のアセットの作成
<a name="model-customize-open-weight-create-assets-ui"></a>

UI でモデルのカスタマイズに使用できるデータセットと評価者アセットを作成および管理できます。

## アセット
<a name="model-customize-open-weight-assets"></a>

左側のパネルと Amazon SageMaker Studio UI で**アセット**を選択し、**データセット**を選択します。

![\[モデルのカスタマイズへのアクセスを含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-16.png)


**Upload Dataset** を選択して、モデルカスタマイズジョブで使用するデータセットを追加します。**必須データ入力形式**を選択すると、使用するデータセット形式のリファレンスにアクセスできます。

![\[モデルのカスタマイズへのアクセスを含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-15.png)


## エバリュエーター
<a name="model-customize-open-weight-evaluators"></a>

強化学習のカスタマイズジョブに**報酬関数**と**報酬プロンプト**を追加することもできます。

![\[モデルのカスタマイズへのアクセスを含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-14.png)


UI は、報酬関数または報酬プロンプトに必要な形式に関するガイダンスも提供します。

![\[モデルのカスタマイズへのアクセスを含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-13.png)


## AWS SDK を使用したモデルカスタマイズのアセット
<a name="model-customize-open-weight-create-assets-sdk"></a>

SageMaker AI Python SDK を使用してアセットを作成することもできます。以下のサンプルコードスニペットを参照してください。

```
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION, REWARD_PROMPT
from sagemaker.ai_registry.dataset import DataSet, CustomizationTechnique
from sagemaker.ai_registry.evaluator import Evaluator

# Creating a dataset example
dataset = DataSet.create(
            name="sdkv3-gen-ds2",
            source="s3://sample-test-bucket/datasets/training-data/jamjee-sft-ds1.jsonl", # or use local filepath as source.
            customization_technique=CustomizationTechnique.SFT
        )

# Refreshes status from hub
dataset.refresh()
pprint(dataset.__dict__)

# Creating an evaluator. Method : Lambda
evaluator = Evaluator.create(
                name = "sdk-new-rf11",
                source="arn:aws:lambda:us-west-2:<>:function:<function-name>8",
                type=REWARD_FUNCTION
        )

# Creating an evaluator. Method : Bring your own code
evaluator = Evaluator.create(
                name = "eval-lambda-test",
                source="/path_to_local/eval_lambda_1.py",
                type = REWARD_FUNCTION
        )

# Optional wait, by default we have wait = True during create call.
evaluator.wait()

evaluator.refresh()
pprint(evaluator)
```

# AI モデルのカスタマイズジョブの送信
<a name="model-customize-open-weight-job"></a>

SageMaker AI モデルのカスタマイズ機能は、左側のパネルの Amazon SageMaker Studio の Models ページからアクセスできます。モデルカスタマイズデータセットと評価者を作成および管理できるアセットページもあります。

![\[モデルのカスタマイズへのアクセスを含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-12.png)


モデルカスタマイズジョブの送信を開始するには、モデルオプションを選択して Jumpstart ベースモデルタブにアクセスします。

![\[ベースモデルの選択方法を含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-11.png)


モデルカードでモデルのカスタマイズを直接クリックすることも、Meta からカスタマイズしたいモデルを検索することもできます。

![\[モデルカードとカスタマイズするモデルの選択方法を含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-10.png)


モデルカードをクリックすると、モデルの詳細ページにアクセスし、モデルのカスタマイズをクリックしてカスタマイズジョブを起動し、UI でカスタマイズを選択して RLVR ジョブの設定を開始できます。

![\[カスタマイズジョブを起動する方法を含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-9.png)


その後、カスタムモデル名を入力し、使用するモデルカスタマイズ手法を選択してジョブハイパーパラメータを設定できます。

![\[モデルカスタマイズ手法の選択を含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-8.png)


![\[モデルカスタマイズ手法の選択を含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-7.png)


## SDK を使用した AI モデルのカスタマイズジョブの送信
<a name="model-customize-open-weight-job-sdk"></a>

SageMaker AI Python SDK を使用してモデルカスタマイズジョブを送信することもできます。

```
# Submit a DPO model customization job

from sagemaker.modules.train.dpo_trainer import DPOTrainer
from sagemaker.modules.train.common import TrainingType

trainer = DPOTrainer(
    model=BASE_MODEL,
    training_type=TrainingType.LORA,
    model_package_group_name=MODEL_PACKAGE_GROUP_NAME,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    sagemaker_session=sagemaker_session,
    role=ROLE_ARN
)
```

## カスタマイズジョブのモニタリング
<a name="model-customize-open-weight-monitor"></a>

ジョブを送信するとすぐに、モデルカスタマイズトレーニングジョブページにリダイレクトされます。

![\[モデルカスタマイズ手法の選択を含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-6.png)


ジョブが完了したら、右上隅にあるカスタムモデルに移動ボタンをクリックして、**カスタムモデル**の詳細ページに移動できます。

![\[モデルカスタマイズ手法の選択を含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-5.png)


カスタムモデルの詳細ページで、次の方法でカスタムモデルをさらに操作できます。

1. パフォーマンス、生成されたアーティファクトの場所、トレーニング設定ハイパーパラメータ、トレーニングログに関する情報を確認します。

1. 別のデータセットで評価ジョブを起動します (継続的なカスタマイズ）。

1. SageMaker AI 推論エンドポイントまたは Amazon Bedrock カスタムモデルインポートを使用してモデルをデプロイします。  
![\[モデルカスタマイズ手法の選択を含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-4.png)

# モデル評価ジョブの送信
<a name="model-customize-open-weight-evaluation"></a>

このセクションでは、オープンウェイトのカスタムモデル評価について説明します。評価ジョブの送信プロセスのウォークスルーから始めます。より高度な評価ジョブ送信のユースケースについては、追加のリソースが用意されています。

**Topics**
+ [開始方法](model-customize-evaluation-getting-started.md)
+ [評価タイプとジョブ送信](model-customize-evaluation-types.md)
+ [評価メトリクスの形式](model-customize-evaluation-metrics-formats.md)
+ [Bring-Your-Own-Dataset (BYOD) タスクでサポートされているデータセット形式](model-customize-evaluation-dataset-formats.md)
+ [プリセットおよびカスタムスコアラーで評価する](model-customize-evaluation-preset-custom-scorers.md)

# 開始方法
<a name="model-customize-evaluation-getting-started"></a>

## SageMaker Studio を介して評価ジョブを送信する
<a name="model-customize-evaluation-studio"></a>

### ステップ 1: モデルカードから評価に移動する
<a name="model-customize-evaluation-studio-step1"></a>

モデルをカスタマイズしたら、モデルカードから評価ページに移動します。

オープンウェイトカスタムモデルトレーニングの詳細については、[https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job.html](https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job.html)

SageMaker は、カスタマイズしたモデルを My Models タブで視覚化します。

![\[登録済みモデルカードページ\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/getting-started-registered-model-card.png)


最新バージョンを表示を選択し、評価を選択します。

![\[モデルのカスタマイズページ\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/getting-started-evaluate-from-model-card.png)


### ステップ 2: 評価ジョブを送信する
<a name="model-customize-evaluation-studio-step2"></a>

送信ボタンを選択し、評価ジョブを送信します。これにより、最小限の MMLU ベンチマークジョブが送信されます。

サポートされている評価ジョブタイプについては、「」を参照してください[評価タイプとジョブ送信](model-customize-evaluation-types.md)。

![\[評価ジョブの送信ページ\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/getting-started-benchmark-submission.png)


### ステップ 3: 評価ジョブの進行状況を追跡する
<a name="model-customize-evaluation-studio-step3"></a>

評価ジョブの進行状況は、評価ステップタブで追跡されます。

![\[評価ジョブの進行状況\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/getting-started-benchmark-tracking.png)


### ステップ 4: 評価ジョブの結果を表示する
<a name="model-customize-evaluation-studio-step4"></a>

評価ジョブの結果は、評価結果タブで視覚化されます。

![\[評価ジョブのメトリクス\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/getting-started-benchmark-results.png)


### ステップ 5: 完了した評価を表示する
<a name="model-customize-evaluation-studio-step5"></a>

完了した評価ジョブは、モデルカードの評価に表示されます。

![\[完了した評価ジョブ\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/getting-started-benchmark-completed-model-card.png)


## SageMaker Python SDK を介して評価ジョブを送信する
<a name="model-customize-evaluation-sdk"></a>

### ステップ 1: BenchMarkEvaluator を作成する
<a name="model-customize-evaluation-sdk-step1"></a>

登録済みのトレーニング済みモデル、 AWS S3 出力場所、MLFlow リソース ARN を に渡`BenchMarkEvaluator`してから初期化します。

```
from sagemaker.train.evaluate import BenchMarkEvaluator, Benchmark  
  
evaluator = BenchMarkEvaluator(  
    benchmark=Benchmark.MMLU,  
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",  
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",  
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",  
    evaluate_base_model=False  
)
```

### ステップ 2: 評価ジョブを送信する
<a name="model-customize-evaluation-sdk-step2"></a>

`evaluate()` メソッドを呼び出して評価ジョブを送信します。

```
execution = evaluator.evaluate()
```

### ステップ 3: 評価ジョブの進行状況を追跡する
<a name="model-customize-evaluation-sdk-step3"></a>

実行の `wait()`メソッドを呼び出して、評価ジョブの進行状況のライブ更新を取得します。

```
execution.wait(target_status="Succeeded", poll=5, timeout=3600)
```

### ステップ 4: 評価ジョブの結果を表示する
<a name="model-customize-evaluation-sdk-step4"></a>

`show_results()` メソッドを呼び出して、評価ジョブの結果を表示します。

```
execution.show_results()
```

# 評価タイプとジョブ送信
<a name="model-customize-evaluation-types"></a>

## 標準化されたデータセットを使用したベンチマーク
<a name="model-customize-evaluation-benchmarking"></a>

ベンチマーク評価タイプを使用して、MMLU や BBH などの一般的なデータセットを含む標準化されたベンチマークデータセット全体でモデルの品質を評価します。


| ベンチマーク | サポートされているカスタムデータセット | モダリティ | 説明 | メトリクス | 方針 | 使用可能なサブタスク | 
| --- | --- | --- | --- | --- | --- | --- | 
| mmlu | いいえ | テキスト | マルチタスク言語理解 — 57 のテーマの知識をテストします。 | 正確性 | zs\$1cot | はい | 
| mmlu\$1pro | いいえ | テキスト | MMLU – プロフェッショナルサブセット – 法律、医学、会計、エンジニアリングなどのプロフェッショナル分野に焦点を当てています。 | 正確性 | zs\$1cot | いいえ | 
| bbh | いいえ | テキスト | 高度な推論タスク – 高レベルの認知スキルと問題解決スキルをテストする困難な問題のコレクション。 | 正確性 | fs\$1cot | はい | 
| gpqa | いいえ | テキスト | 一般的な物理質問への回答 – 物理の概念と関連する問題解決能力の理解を評価します。 | 正確性 | zs\$1cot | いいえ | 
| 算術 | いいえ | テキスト | 数学的問題解決 — 代数、微積分、単語の問題など、複数のトピックにわたって数学的推論を測定します。 | exact\$1match | zs\$1cot | はい | 
| strong\$1reject | いいえ | テキスト | 品質管理タスク – 不適切、有害、または誤ったコンテンツを検出して拒否するモデルの能力をテストします。 | deflection | zs | はい | 
| ifeval | いいえ | テキスト | 指示に従った評価 – どの程度モデルが所定の指示に正確に従い、仕様に従ってタスクを完了したかを評価します。 | 正確性 | zs | いいえ | 

BYOD 形式の詳細については、「」を参照してください[Bring-Your-Own-Dataset (BYOD) タスクでサポートされているデータセット形式](model-customize-evaluation-dataset-formats.md)。

### 使用可能なサブタスク
<a name="model-customize-evaluation-benchmarking-subtasks"></a>

以下は、MMLU (Massive Multitask Language Understanding)、BBH (Big Bench Hard)、StrongReject、MATH など、複数のドメインでモデル評価に使用できるサブタスクの一覧です。これらのサブタスクでは、特定の機能や知識領域におけるモデルのパフォーマンスを評価できます。

**MMLU サブタスク**

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

**BBH サブタスク**

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

**数学サブタスク**

```
MATH_SUBTASKS = [
    "algebra", 
    "counting_and_probability", 
    "geometry",
    "intermediate_algebra", 
    "number_theory", 
    "prealgebra", 
    "precalculus"
]
```

**StrongReject サブタスク**

```
STRONG_REJECT_SUBTASKS = [
    "gcg_transfer_harmbench", 
    "gcg_transfer_universal_attacks",
    "combination_3", 
    "combination_2", 
    "few_shot_json", 
    "dev_mode_v2",
    "dev_mode_with_rant",
    "wikipedia_with_title", 
    "distractors",
    "wikipedia",
     "style_injection_json", 
    "style_injection_short",
    "refusal_suppression", 
    "prefix_injection", 
    "distractors_negated",
    "poems", 
    "base64", 
    "base64_raw", "
    base64_input_only",
    "base64_output_only", 
    "evil_confidant", 
    "aim", 
    "rot_13",
    "disemvowel", 
    "auto_obfuscation", 
    "auto_payload_splitting", 
    "pair",
    "pap_authority_endorsement", 
    "pap_evidence_based_persuasion",
    "pap_expert_endorsement", 
    "pap_logical_appeal", 
    "pap_misrepresentation"
]
```

### ベンチマークジョブを送信する
<a name="model-customize-evaluation-benchmarking-submit"></a>

------
#### [ SageMaker Studio ]

![\[SageMaker Studio によるベンチマークの最小設定\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import get_benchmarks
from sagemaker.train.evaluate import BenchMarkEvaluator

Benchmark = get_benchmarks()

# Create evaluator with MMLU benchmark
evaluator = BenchMarkEvaluator(
benchmark=Benchmark.MMLU,
model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
s3_output_path="s3://<bucket-name>/<prefix>/",
evaluate_base_model=False
)

execution = evaluator.evaluate()
```

SageMaker Python SDK による評価ジョブの送信の詳細については、[https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html) を参照してください。

------

## Large Language Model as a Judge (LLMAJ) の評価
<a name="model-customize-evaluation-llmaj"></a>

LLM-as-a-judge (LLMAJ) 評価を使用して、別のフロンティアモデルを活用してターゲットモデルのレスポンスを評価できます。`create_evaluation_job` API を呼び出して評価ジョブを起動することで、 AWS Bedrock モデルを審査員として使用できます。

サポートされている判事モデルの詳細については、[https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html) を参照してください。

2 つの異なるメトリクス形式を使用して評価を定義できます。
+ **組み込みメトリクス:** AWS Bedrock 組み込みメトリクスを活用して、モデルの推論レスポンスの品質を分析します。詳細については、[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html) を参照してください。
+ **カスタムメトリクス:** Bedrock Evaluation カスタムメトリクス形式で独自のカスタムメトリクスを定義して、独自の指示を使用してモデルの推論レスポンスの品質を分析します。詳細については、[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html) を参照してください。

### 組み込みメトリクス LLMAJ ジョブを送信する
<a name="model-customize-evaluation-llmaj-builtin"></a>

------
#### [ SageMaker Studio ]

![\[SageMaker Studio による LLMAJ ベンチマークの最小設定\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/llmaj-as-judge-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import LLMAsJudgeEvaluator

evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"],
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

SageMaker Python SDK による評価ジョブの送信の詳細については、[https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html) を参照してください。

------

### カスタムメトリクス LLMAJ ジョブを送信する
<a name="model-customize-evaluation-llmaj-custom"></a>

カスタムメトリクス (複数可) を定義します。

```
{
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}
```

詳細については、[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html) を参照してください。

------
#### [ SageMaker Studio ]

![\[カスタムメトリクス経由でカスタムメトリクスをアップロードする > カスタムメトリクスを追加する\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/custom-llmaj-metrics-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    custom_metrics=custom_metric_dict = {
        "customMetricDefinition": {
            "name": "PositiveSentiment",
            "instructions": (
                "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
                "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
                "Consider the following:\n"
                "- Does the response have a positive, encouraging tone?\n"
                "- Is the response helpful and constructive?\n"
                "- Does it avoid negative language or criticism?\n\n"
                "Rate on this scale:\n"
                "- Good: Response has positive sentiment\n"
                "- Poor: Response lacks positive sentiment\n\n"
                "Here is the actual task:\n"
                "Prompt: {{prompt}}\n"
                "Response: {{prediction}}"
            ),
            "ratingScale": [
                {"definition": "Good", "value": {"floatValue": 1}},
                {"definition": "Poor", "value": {"floatValue": 0}}
            ]
        }
    },
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)
```

------

## カスタムスコアラー
<a name="model-customize-evaluation-custom-scorers"></a>

独自のカスタムスコアラー関数を定義して、評価ジョブを起動します。システムは、Prime Math と Prime コードの 2 つの組み込みスコアラーを提供します。独自のスコアラー関数を使用することもできます。スコアラー関数コードを直接コピーするか、関連付けられた ARN を使用して独自の Lambda 関数定義を取り込むことができます。デフォルトでは、どちらのスコアラータイプも、F1 スコア、ROUGE、BLEU などの標準メトリクスを含む評価結果を生成します。

組み込みスコアラーとカスタムスコアラー、およびそれぞれの要件/契約の詳細については、「」を参照してください[プリセットおよびカスタムスコアラーで評価する](model-customize-evaluation-preset-custom-scorers.md)。

### データセットを登録する
<a name="model-customize-evaluation-custom-scorers-register-dataset"></a>

SageMaker Hub コンテンツデータセットとして登録して、カスタムスコアラーに独自のデータセットを作成します。

------
#### [ SageMaker Studio ]

Studio で、専用のデータセットページを使用してデータセットをアップロードします。

![\[SageMaker Studio に登録された評価データセット\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/dataset-registration-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

SageMaker Python SDK で、専用のデータセットページを使用してデータセットをアップロードします。

```
from sagemaker.ai_registry.dataset import DataSet

dataset = DataSet.create(
    name="your-bring-your-own-dataset",
    source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl"
)
dataset.refresh()
```

------

### 組み込みスコアラージョブを送信する
<a name="model-customize-evaluation-custom-scorers-builtin"></a>

------
#### [ SageMaker Studio ]

![\[組み込みカスタムスコアリングのコード実行または数学回答から選択\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/builtin-scorer-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import CustomScorerEvaluator
from sagemaker.train.evaluate import get_builtin_metrics

BuiltInMetric = get_builtin_metrics()

evaluator_builtin = CustomScorerEvaluator(
    evaluator=BuiltInMetric.PRIME_MATH,
    dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

組み込みスコアリング`BuiltInMetric.PRIME_CODE`には `BuiltInMetric.PRIME_MATH` または を選択します。

------

### カスタムスコアラージョブを送信する
<a name="model-customize-evaluation-custom-scorers-custom"></a>

カスタム報酬関数を定義します。詳細については、「[カスタムスコアラー (独自のメトリクスを使用)](model-customize-evaluation-preset-custom-scorers.md#model-customize-evaluation-custom-scorers-byom)」を参照してください。

**カスタム報酬関数を登録する**

------
#### [ SageMaker Studio ]

![\[SageMaker Studio に移動する > アセット > 評価者 > 評価者を作成する > 報酬関数を作成する\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/custom-scorer-submission-sagemaker-studio.png)


![\[登録済みのプリセット報酬関数を参照するカスタムスコアラー評価ジョブをカスタムスコアラー > カスタムメトリクスで送信する\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/custom-scorer-benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.ai_registry.evaluator import Evaluator
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION

evaluator = Evaluator.create(
    name = "your-reward-function-name",
    source="/path_to_local/custom_lambda_function.py",
    type = REWARD_FUNCTION
)
```

```
evaluator = CustomScorerEvaluator(
    evaluator=evaluator,
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

# 評価メトリクスの形式
<a name="model-customize-evaluation-metrics-formats"></a>

以下のメトリクス形式でモデルの品質を評価します。
+ モデル評価の概要
+ MLFlow
+ TensorBoard

## モデル評価の概要
<a name="model-customize-evaluation-metrics-summary"></a>

評価ジョブを送信するときは、 AWS S3 出力場所を指定します。SageMaker は、評価概要 .json ファイルを ロケーションに自動的にアップロードします。ベンチマーク概要の S3 パスは次のとおりです。

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/eval_results/
```

** AWS S3 の場所を渡す**

------
#### [ SageMaker Studio ]

![\[出力アーティファクトの場所に渡す (AWS S3 URI)\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

 AWS S3 の場所`.json`から として直接読み取るか、UI で自動的に視覚化します。

```
{
  "results": {
    "custom|gen_qa_gen_qa|0": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    },
    "all": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    }
  }
}
```

![\[SageMaker Studio で視覚化されたカスタム gen-qa ベンチマークのサンプルパフォーマンスメトリクス\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/gen-qa-metrics-visualization-sagemaker-studio.png)


## MLFlow ログ記録
<a name="model-customize-evaluation-metrics-mlflow"></a>

**SageMaker MLFlow リソース ARN を指定する**

SageMaker Studio は、モデルカスタマイズ機能を初めて使用する場合に、各 Studio ドメインでプロビジョニングされるデフォルトの MLFlow アプリを使用します。SageMaker Studio は、評価ジョブの送信にデフォルトの MLflow アプリに関連付けられた ARN を使用します。

評価ジョブを送信し、MLFlow リソース ARN を明示的に指定して、リアルタイム分析のためにメトリクスを関連する追跡サーバー/アプリケーションにストリーミングすることもできます。

**SageMaker Python SDK**

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

モデルレベルとシステムレベルのメトリクスの視覚化:

![\[MMLU ベンチマークタスクのモデルレベルのエラーと精度の例\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/model-metrics-mlflow.png)


![\[LLMAJ ベンチマークタスクの組み込みメトリクスのサンプル\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/llmaj-metrics-mlflow.png)


![\[MMLU ベンチマークタスクのシステムレベルのメトリクスの例\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/system-metrics-mlflow.png)


## TensorBoard
<a name="model-customize-evaluation-metrics-tensorboard"></a>

 AWS S3 出力場所を使用して評価ジョブを送信します。SageMaker は、TensorBoard ファイルを の場所に自動的にアップロードします。

SageMaker は TensorBoard ファイルを次の場所の AWS S3 にアップロードします。

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/tensorboard_results/eval/
```

**次のように AWS S3 の場所を渡します。**

------
#### [ SageMaker Studio ]

![\[出力アーティファクトの場所に渡す (AWS S3 URI)\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

**サンプルモデルレベルのメトリクス**

![\[ベンチマークジョブの結果を表示する SageMaker TensorBoard\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/metrics-in-tensorboard.png)


# Bring-Your-Own-Dataset (BYOD) タスクでサポートされているデータセット形式
<a name="model-customize-evaluation-dataset-formats"></a>

Custom Scorer および LLM-as-judge 評価タイプには、 AWS S3 にあるカスタムデータセット JSONL ファイルが必要です。ファイルを、次のいずれかのサポートされている形式に準拠した JSON Lines ファイルとして指定する必要があります。このドキュメントの例は、わかりやすくするために展開されています。

各形式には独自のニュアンスがありますが、少なくともすべてユーザープロンプトが必要です。


**必須フィールド**  

| フィールド | 必要 | 
| --- | --- | 
| ユーザープロンプト | はい | 
| システムプロンプト | いいえ | 
| グラウンドトゥルース | カスタムスコアラーのみ | 
| Category | いいえ | 

**1. OpenAI 形式**

```
{
    "messages": [
        {
            "role": "system",    # System prompt (looks for system role)
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",       # Query (looks for user role)
            "content": "Hello!"
        },
        {
            "role": "assistant",  # Ground truth (looks for assistant role)
            "content": "Hello to you!"
        }
    ]
}
```

**2. SageMaker の評価**

```
{
   "system":"You are an English major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?", # Ground truth
   "category": "Grammar"
}
```

**3. HuggingFace プロンプトの完了**

標準形式と会話形式の両方がサポートされています。

```
# Standard

{
    "prompt" : "What is the symbol that ends the sentence as a question", # Query
    "completion" : "?" # Ground truth
}

# Conversational
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What is the symbol that ends the sentence as a question"
        }
    ],
    "completion": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "?"
        }
    ]
}
```

**4. HuggingFace の設定**

標準形式 (文字列) と会話形式 (メッセージ配列) の両方をサポートします。

```
# Standard: {"prompt": "text", "chosen": "text", "rejected": "text"}
{
     "prompt" : "The sky is", # Query
     "chosen" : "blue", # Ground truth
     "rejected" : "green"
}

# Conversational:
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What color is the sky?"
        }
    ],
    "chosen": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "It is blue."
        }
    ],
    "rejected": [
        {
            "role": "assistant",
            "content": "It is green."
        }
    ]
}
```

**5。Verl 形式**

Verl 形式 (現在の形式とレガシー形式の両方) は、強化学習のユースケースでサポートされています。参照用 Verl ドキュメント: [https://verl.readthedocs.io/en/latest/preparation/prepare\$1data.html](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)

VERL 形式のユーザーは通常、グラウンドトゥルースレスポンスを提供しません。いずれかを指定する場合は、 フィールド`extra_info.answer`または のいずれかを使用します`reward_model.ground_truth`。 `extra_info`が優先されます。

SageMaker は、存在する場合、次の VERL 固有のフィールドをメタデータとして保持します。
+ `id`
+ `data_source`
+ `ability`
+ `reward_model`
+ `extra_info`
+ `attributes`
+ `difficulty`

```
# Newest VERL format where `prompt` is an array of messages.
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "You are a helpful math tutor who explains solutions to questions step-by-step.",
      "role": "system"
    },
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  },
  "reward_model": {
    "ground_truth": "72" # Ignored in favor of extra_info.answer
  }
}

# Legacy VERL format where `prompt` is a string. Also supported.
{
  "data_source": "openai/gsm8k",
  "prompt": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

# プリセットおよびカスタムスコアラーで評価する
<a name="model-customize-evaluation-preset-custom-scorers"></a>

カスタムスコアラー評価タイプを使用する場合、SageMaker 評価は、[volcengine/verl](https://github.com/volcengine/verl) RL トレーニングライブラリから取得した 2 つの組み込みスコアラー (「報酬関数」とも呼ばれます) Prime Math と Prime Code、または Lambda 関数として実装された独自のカスタムスコアラーをサポートします。

## 組み込みスコアラー
<a name="model-customize-evaluation-builtin-scorers"></a>

**Prime Math**

素数学スコアラーは、数学の質問を含むエントリのカスタム JSONL データセットをプロンプト/クエリとして想定し、正しい回答をグラウンドトゥルースとして想定します。データセットは、「」で説明されているサポートされている形式のいずれかにすることができます[Bring-Your-Own-Dataset (BYOD) タスクでサポートされているデータセット形式](model-customize-evaluation-dataset-formats.md)。

データセットエントリの例 (わかりやすくするために拡張):

```
{
    "system":"You are a math expert: ",
    "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    "response":"2" # Ground truth aka correct answer
}
```

**Prime コード**

Prime Code Scorer は、 `metadata`フィールドで指定されたコーディング問題とテストケースを含むエントリのカスタム JSONL データセットを想定しています。各エントリ、サンプル入力、および期待される出力に期待される関数名を使用してテストケースを構造化します。

データセットエントリの例 (わかりやすくするために拡張):

```
{
    "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n",
    "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task:  \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.",
    "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1.
    ### Define test cases in metadata field
    "metadata": {
        "fn_name": "factorialNumbers",
        "inputs": ["5"],
        "outputs": ["[1, 2]"]
    }
}
```

## カスタムスコアラー (独自のメトリクスを使用)
<a name="model-customize-evaluation-custom-scorers-byom"></a>

ニーズに合わせてカスタマイズされたカスタムメトリクスを計算できるカスタム後処理ロジックを使用して、モデル評価ワークフローを完全にカスタマイズします。モデルレスポンスを受け入れて報酬スコアを返す AWS Lambda 関数としてカスタムスコアラーを実装する必要があります。

### Lambda 入力ペイロードのサンプル
<a name="model-customize-evaluation-custom-scorers-lambda-input"></a>

カスタム AWS Lambda は OpenAI 形式の入力を想定しています。例:

```
{
    "id": "123",
    "messages": [
        {
            "role": "user",
            "content": "Do you have a dedicated security team?"
        },
        {
            "role": "assistant",
            "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
        }
    ],
    "reference_answer": {
        "compliant": "No",
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    }
}
```

### Lambda 出力ペイロードのサンプル
<a name="model-customize-evaluation-custom-scorers-lambda-output"></a>

SageMaker 評価コンテナは、Lambda レスポンスが次の形式に従うことを想定しています。

```
{
    "id": str,                              # Same id as input sample
    "aggregate_reward_score": float,        # Overall score for the sample
    "metrics_list": [                       # OPTIONAL: Component scores
        {
            "name": str,                    # Name of the component score
            "value": float,                 # Value of the component score
            "type": str                     # "Reward" or "Metric"
        }
    ]
}
```

### カスタム Lambda 定義
<a name="model-customize-evaluation-custom-scorers-lambda-definition"></a>

サンプル入力と予想される出力を備えた完全に実装されたカスタムスコアラーの例については、[https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html\$1nova-reward-llm-judge-example](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example) を参照してください。

次のスケルトンを独自の関数の開始点として使用します。

```
def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        Samples: List of dictionaries in OpenAI format
            
        Example input:
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            # This section is the same as your training dataset
            "reference_answer": {
                "compliant": "No",
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."
            }
        }
        
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """
```

### 入力フィールドと出力フィールド
<a name="model-customize-evaluation-custom-scorers-fields"></a>

**入力フィールド**


| フィールド | 説明 | 追加のメモ | 
| --- | --- | --- | 
| id | サンプルの一意の識別子 | 出力にエコーバックしました。文字列形式 | 
| メッセージ | OpenAI 形式の順序付けられたチャット履歴 | メッセージオブジェクトの配列 | 
| messages[].role | メッセージの発話者 | 一般的な値: "user"、"assistant"、"system" | 
| messages[].content | メッセージのテキストコンテンツ | プレーン文字列 | 
| メタデータ | グレーディングに役立つ自由形式の情報 | オブジェクト、トレーニングデータから渡されるオプションのフィールド | 

**出力フィールド**


**出力フィールド**  

| フィールド | 説明 | 追加のメモ | 
| --- | --- | --- | 
| id | 入力サンプルと同じ識別子 | 入力と一致する必要があります | 
| aggregate\$1reward\$1score | サンプルの全体的なスコア | 浮動小数点数 (例: 0.0～1.0 またはタスク定義の範囲) | 
| metrics\$1list | 集計を構成するコンポーネントスコア | メトリクスオブジェクトの配列 | 

### 必要な許可
<a name="model-customize-evaluation-custom-scorers-permissions"></a>

評価の実行に使用する SageMaker 実行ロールに AWS Lambda アクセス許可があることを確認します。

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}
```

 AWS Lambda 関数の実行ロールに、基本的な Lambda 実行アクセス許可と、ダウンストリーム AWS 呼び出しに必要な追加のアクセス許可があることを確認します。

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
```

# モデルのデプロイ
<a name="model-customize-open-weight-deployment"></a>

カスタムモデルの詳細ページから、SageMaker AI 推論エンドポイントまたは Amazon Bedrock を使用してカスタムモデルをデプロイすることもできます。

![\[モデルカスタマイズ手法の選択を含むイメージ。\]](http://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/images/screenshot-open-model-1.png)


# サンプルデータセットと評価者
<a name="model-customize-open-weight-samples"></a>

## 教師ありファインチューニング (SFT)
<a name="model-customize-open-weight-samples-sft"></a>
+ 名前: TAT-QA
+ ライセンス: CC-BY-4.0
+ リンク: [https://huggingface.co/datasets/next-tat/TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
+ 前処理 - フォーマット

**1 つのサンプル **

```
{
    "prompt": "Given a table and relevant text descriptions, answer the following question.\n\nTable:\n<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td></td>\n      <td>2019</td>\n      <td>2018</td>\n    </tr>\n    <tr>\n      <td></td>\n      <td>$'000</td>\n      <td>$'000</td>\n    </tr>\n    <tr>\n      <td>Revenue from external customers</td>\n      <td></td>\n      <td></td>\n    </tr>\n    <tr>\n      <td>Australia</td>\n      <td>144,621</td>\n      <td>129,431</td>\n    </tr>\n    <tr>\n      <td>New Zealand</td>\n      <td>13,036</td>\n      <td>8,912</td>\n    </tr>\n    <tr>\n      <td>Total</td>\n      <td>157,657</td>\n      <td>138,343</td>\n    </tr>\n  </tbody>\n</table>\n\nParagraphs:\n    4. SEGMENT INFORMATION\n\n    During the 2019 and 2018 financial years, the Group operated wholly within one business segment being the operation and management of storage centres in Australia and New Zealand.\n\n    The Managing Director is the Group\u2019s chief operating decision maker and monitors the operating results on a portfolio wide basis. Monthly management reports are evaluated based upon the overall performance of NSR consistent with the presentation within the consolidated financial statements. The Group\u2019s financing (including finance costs and finance income) are managed on a Group basis and not allocated to operating segments.\n\n    The operating results presented in the statement of profit or loss represent the same segment information as reported in internal management information.\n\n    The revenue information above excludes interest income and is based on the location of storage centres.\n\nQuestion: How many business segments were present in 2019 and 2018?\n\nAnswer:",
    "completion": "one",
    "data_idx": "2951"
}
```

## 直接選好最適化 (DPO)
<a name="model-customize-open-weight-samples-dpo"></a>
+ 名前: Ultrafeedback
+ ライセンス: MIT
+ リンク: [https://huggingface.co/datasets/openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
+ 前処理ステップ: フォーマットとサブサンプリング。

**1 つのサンプル**

```
{
    "source": "evol_instruct",
    "prompt": "Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:\n#include <iostream>\n#include <string>\nusing namespace std;\nint main() {\n    string country;\n    // prompt user for input\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n    // check if country borders the Mediterranean Sea\n    // [C++ code]\n    return 0;\n}",
    "chosen": "Here's a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea:\n\n#include <iostream>\n#include <string>\n#include <set>\n#include <map>\n#include <algorithm>\n\nusing namespace std;\n\nint main() {\n    // store countries and their bordering seas in a map\n    map<string, set<string>> countries;\n    countries[\"Algeria\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n    countries[\"France\"] = {\"Mediterranean Sea\", \"English Channel\"};\n    countries[\"Italy\"] = {\"Mediterranean Sea\", \"Adriatic Sea\"};\n    countries[\"Malta\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n    countries[\"Spain\"] = {\"Mediterranean Sea\", \"Atlantic Ocean\"};\n    countries[\"Tunisia\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n\n    string country;\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n\n    // check if country is in the map\n    if (countries.find(country)!= countries.end()) {\n        // check if country borders the Mediterranean Sea\n        set<string> borders = countries.at(country);\n        if (borders.find(\"Mediterranean Sea\")!= borders.end()) {\n            cout << country << \" borders the Mediterranean Sea.\" << endl;\n        } else {\n            cout << country << \" does not border the Mediterranean Sea.\" << endl;\n        }\n    } else {\n        cout << country << \" is not a recognized country.\" << endl;\n    }\n\n    return 0;\n}\n\nIn this program, we first define a map called countries that stores each country's name as a key and a set of its surrounding seas as the value. We use a set to store each sea name so that we can easily check if a country borders a certain sea.\n\nWe then present the user with a prompt to enter the name of a country. We use a while loop to ensure that the user enters a valid country name. We check if the country is in the map, and if so, we retrieve its bordering seas from the map.\n\nFinally, we check if the Mediterranean Sea is among the set of borders. If it is, we inform the user that the country borders the Mediterranean Sea. If not, we inform the user that the country does not border the Mediterranean Sea.",
    "chosen-rating": 5.0,
    "chosen-model": "starchat",
    "rejected": "Sure, here is the program using the C++11 algorithm \"cds::algorithm::GreaterEqual\":\n#include <iostream>\n#include <string>\n#include <algorithm>\n#include <vector>\n#include <cctype>\n\nusing namespace std;\n\nint main() {\n    string country;\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n    std::vector<string> vec;\n    vec.push_back(country);\n    size_t index = std::find_if(vec.begin(), vec.end(), [](const string& s) {\n        return std::any_of(s.begin(), s.end(), [](const char& c) {\n            return c == '}}';\n    });\n    if (index != vec.end()) {\n        if ((*index)[0] == 'M') {\n            cout << \"Country is bordered by the Mediterranean Sea.\" << endl;\n        } else {\n            cout << \"Country does not border the Mediterranean Sea.\" << endl;\n        }\n    } else {\n        cout << \"Country is not found.\" << endl;\n    }\n    return 0;\n}",
    "rejected-rating": 1.25,
    "rejected-model": "pythia-12b"
}
```

## AI フィードバックからの強化学習 (RLAIF) - ペアワイズ判断
<a name="model-customize-open-weight-samples-rlaif"></a>

**入力データセット**

ソースデータセット: [https://github.com/WeOpenML/PandaLM](https://github.com/WeOpenML/PandaLM)

**1 つのサンプル**

```
{
    "data_source": "WeOpenML/PandaLM",
    "prompt": [
        {
            "role": "user",
            "content": "Below are two responses for a given task. The task is defined by the Instruction with an Input that provides further context. Evaluate the responses and generate a reference answer for the task.\n\n
            ### Instruction:\nCompare the given products.\n\n### Input:\niPhone 11 and Google Pixel 4\n\n
            ### Response 1:\nThe iPhone 11 has a larger screen size and a longer battery life than the Google Pixel 4.\n\n
            ### Response 2:\nThe iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n\n### Evaluation:\n"
        }
    ],
    "ability": "pairwise-judging",
    "reward_model": {
        "style": "llmj",
        "ground_truth": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor."
    },
    "extra_info": {
        "split": "train",
        "index": 0,
        "raw_output_sequence": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n",
        "llmj": {
            "question": "Below are two responses for a given task. The task is defined by the Instruction with an Input that provides further context. Evaluate the responses and generate a reference answer for the task.\n\n### Instruction:\nCompare the given products.\n\n### Input:\niPhone 11 and Google Pixel 4\n\n### Response 1:\nThe iPhone 11 has a larger screen size and a longer battery life than the Google Pixel 4.\n\n### Response 2:\nThe iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n\n### Evaluation:\n",
            "ground_truth": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.",
            "document_in_context": null
        },
        "sample_size": 1980
    }
}
```

## RLAIF - 思考の連鎖
<a name="model-customize-open-weight-samples-rlaif2"></a>

**入力データセット **

ソースデータ: [https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**1 つのサンプル**

```
{
    "data_source": "openai/gsm8k",
    "prompt": [
        {
            "role": "system",
            "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.\n\nFollow these steps:\n- Think through the problem step by step within the ‹thinking> tags.\n- Reflect on your thinking\n\nto check for any errors or improvements within the ‹reflection› tags.\n- Make any necessary adjustments based on your reflection.\n- Provide your final, concise answer within the ‹output> tags.\n\nImportant: The <thinking> and ‹reflection› sections are for your internal reasoning process only.\nDo not include any part of the final answer in these sections.\nThe actual response to the query must be entirely contained within the ‹output› tags.\n\nUse the following format for your response:\n<thinking>\n[Your initial thought process goes here]\n</thinking›\n\n<reasoning>\n[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer. You can create as many reasoning steps as necessary in your process.]\n</reasoning>\n\n‹reflection>\n[Your reflection on your reasoning, checking for errors or improvements. You can create as many reflection steps as necessary in your process.]\n</ reflection>\n\n<adjustment>\n[Any adjustments to your thinking based on your reflection]\n</adjustment>\n\n<output>\n[Your final, concise answer to the query. This is the only part that will be shown to the user.]\n</output>\n"
        },
        {
            "role": "user",
            "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?"
        }
    ],
    "ability": "chain-of-thought",
    "reward_model": {
        "style": "llmj-cot",
        "ground_truth": "Thus, there were 36 - 12 - 9 = <<36-12-9=15>>15 sales in the stationery section."
    },
    "extra_info": {
        "split": "train",
        "index": 0,
        "question": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
        "short_answer": "15",
        "model_output": "<thinking>\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections. The store makes a third of its sales in the fabric section, which is 1/3 of 36 sales. I can calculate this as 36 * 1/3 = 12 sales in the fabric section. The store makes a quarter of its sales in the jewelry section, which is 1/4 of 36 sales. I can calculate this as 36 * 1/4 = 9 sales in the jewelry section. Since the rest of the sales are in the stationery section, I can subtract the sales from the other two sections from the total number of sales to find the number of sales in the stationery section.\n</thinking>\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n‹reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct, but I should ensure that I understand the question correctly. The question asks for the number of sales in the stationery section, and I believe I have correctly calculated that. However, I should double-check my math to make sure it's accurate.\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct. The fabric section makes up 1/3 of the sales, the jewelry section makes up 1/4, and the rest is the stationery section. My math checks out, and I am confident in my answer.\n</adjustment>\n\n<output>\n15\n</output>"
    }
}
```

## RLAIF - 忠実度
<a name="model-customize-open-weight-samples-rlaif3"></a>

**入力データセット**

ソース: [https://huggingface.co/datasets/rajpurkar/squad\$1v2/blob/main/squad\$1v2/train-00000-of-00001.parquet](https://huggingface.co/datasets/rajpurkar/squad_v2/blob/main/squad_v2/train-00000-of-00001.parquet)

**1 つのサンプル**

```
{
    "data_source": "squad_v2",
    "prompt": [
        {
            "role": "system",
            "content": "You are a helpful assistant that answers questions based on the provided context. Only use information from the context."
        },
        {
            "role": "user",
            "content": "Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \"Crazy in Love\" and \"Baby Boy\".\n\nQuestion: When did Beyonce start becoming popular?"
        }
    ],
    "ability": "faithfulness",
    "reward_model": {
        "style": "llmj-faithfulness",
        "ground_truth": "Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \"Crazy in Love\" and \"Baby Boy\"."
    },
    "extra_info": {
        "question": "When did Beyonce start becoming popular?",
        "split": "train",
        "index": 0
    }
}
```

## RLAIF - 要約
<a name="model-customize-open-weight-samples-rlaif4"></a>

**入力データセット**

ソース: Cleaned gsm8k データセット [https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**1 つのサンプル**

```
{
    "data_source": "cnn_dailymail",
    "prompt": [
        {
            "role": "system",
            "content": "You are a helpful assistant that creates concise, accurate summaries of news articles. Focus on the key facts and main points."
        },
        {
            "role": "user",
            "content": "Summarize the following article:\n\nLONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in \"Harry Potter and the Order of the Phoenix\" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. \"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,\" he told an Australian interviewer earlier this month. \"I don't think I'll be particularly extravagant. \"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs.\" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film \"Hostel: Part II,\" currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. \"I'll definitely have some sort of party,\" he said in an interview. \"Hopefully none of you will be reading about it.\" Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. \"People are always looking to say 'kid star goes off the rails,'\" he told reporters last month. \"But I try very hard not to go that way because it would be too easy for them.\" His latest outing as the boy wizard in \"Harry Potter and the Order of the Phoenix\" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called \"My Boy Jack,\" about author Rudyard Kipling and his son, due for release later this year. He will also appear in \"December Boys,\" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's \"Equus.\" Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: \"I just think I'm going to be more sort of fair game,\" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed."
        }
    ],
    "ability": "summarization",
    "reward_model": {
        "style": "llmj-summarization",
        "ground_truth": "Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund ."
    },
    "extra_info": {
        "question": "LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in \"Harry Potter and the Order of the Phoenix\" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. \"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,\" he told an Australian interviewer earlier this month. \"I don't think I'll be particularly extravagant. \"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs.\" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film \"Hostel: Part II,\" currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. \"I'll definitely have some sort of party,\" he said in an interview. \"Hopefully none of you will be reading about it.\" Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. \"People are always looking to say 'kid star goes off the rails,'\" he told reporters last month. \"But I try very hard not to go that way because it would be too easy for them.\" His latest outing as the boy wizard in \"Harry Potter and the Order of the Phoenix\" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called \"My Boy Jack,\" about author Rudyard Kipling and his son, due for release later this year. He will also appear in \"December Boys,\" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's \"Equus.\" Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: \"I just think I'm going to be more sort of fair game,\" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed.",
        "split": "train",
        "index": 0,
        "source_id": "42c027e4ff9730fbb3de84c1af0d2c50"
    }
}
```

## RLAIF - カスタムプロンプト
<a name="model-customize-open-weight-samples-rlaif5"></a>

この例では、 [RLAIF - 思考の連鎖](#model-customize-open-weight-samples-rlaif2)を使用して、カスタム jinja プロンプトがプリセットプロンプトの 1 つを置き換える方法について説明します。

**以下は、CoT のカスタムプロンプトの例です。**

```
You are an expert logical reasoning evaluator specializing in Chain-of-Thought (CoT) analysis. 

Given: A problem prompt and a model's reasoning-based response. 

Goal: Assess the quality and structure of logical reasoning, especially for specialized domains (law, medicine, finance, etc.).

Scoring rubric (start at 0.0, then add or subtract):

Core Components:

Structural Completeness (0.3 max)
- Clear problem statement: +0.05
- Defined variables/terminology: +0.05
- Organized given information: +0.05
- Explicit proof target: +0.05
- Step-by-step reasoning: +0.05
- Clear conclusion: +0.05

Logical Quality (0.4 max)
- Valid logical flow: +0.1
- Proper use of if-then relationships: +0.1
- Correct application of domain principles: +0.1
- No logical fallacies: +0.1

Technical Accuracy (0.3 max)
- Correct use of domain terminology: +0.1
- Accurate application of domain rules: +0.1
- Proper citation of relevant principles: +0.1

Critical Deductions:
A. Invalid logical leap: -0.3
B. Missing critical steps: -0.2
C. Incorrect domain application: -0.2
D. Unclear/ambiguous reasoning: -0.1

Additional Instructions:
- Verify domain-specific terminology and principles
- Check for logical consistency throughout
- Ensure conclusions follow from premises
- Flag potential domain-specific compliance issues
- Consider regulatory/professional standards where applicable

Return EXACTLY this JSON (no extra text):
{
    "score": <numerical score 0.0-1.0>,
    "component_scores": {
        "structural_completeness": <score>,
        "logical_quality": <score>,
        "technical_accuracy": <score>
    },
    "steps_present": {
        "problem_statement": <true/false>,
        "variable_definitions": <true/false>,
        "given_information": <true/false>,
        "proof_target": <true/false>,
        "step_reasoning": <true/false>,
        "conclusion": <true/false>
    },
    "reasoning": "<explain scoring decisions and identify any logical gaps>",
    "domain_flags": ["<any domain-specific concerns or compliance issues>"]
}

### (Prompt field from dataset)
Problem Prompt: {{ prompt }}

Model's Response: {{ model_output }}

### Ground truth (if applicable):
{{ ground_truth }}
```

## Verifiable Rewards (RLVR) からの強化学習 - 完全一致
<a name="model-customize-open-weight-samples-RLVR"></a>

**入力データセット**

ソース: [https://huggingface.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)

**サンプル**

```
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let\'s think step by step and output the final answer after \\"####\\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "reward_model": {
    "ground_truth": "72",
    "style": "rule"
  },
  "extra_info": {
    "answer": "Natalia sold 48\\/2 = <<48\\/2=24>>24 clips in May.\\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

## RLVR - コード実行
<a name="model-customize-open-weight-samples-RLVR2"></a>

**入力データセット**

ソース: [https://huggingface.co/datasets/open-r1/codeforces](https://huggingface.co/datasets/open-r1/codeforces)

**サンプル**

```
{
  "data_source": "codeforces",
  "prompt": [
    {
      "content": "\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\n\n[ASSESS]\n\n[ADVANCE]\n\n[VERIFY]\n\n[SIMPLIFY]\n\n[SYNTHESIZE]\n\n[PIVOT]\n\n[OUTPUT]\n\nYou should strictly follow the format below:\n\n[ACTION NAME]\n\n# Your action step 1\n\n# Your action step 2\n\n# Your action step 3\n\n...\n\nNext action: [NEXT ACTION NAME]\n\n",
      "role": "system"
    },
    {
      "content": "Title: Zebras\n\nTime Limit: None seconds\n\nMemory Limit: None megabytes\n\nProblem Description:\nOleg writes down the history of the days he lived. For each day he decides if it was good or bad. Oleg calls a non-empty sequence of days a zebra, if it starts with a bad day, ends with a bad day, and good and bad days are alternating in it. Let us denote bad days as 0 and good days as 1. Then, for example, sequences of days 0, 010, 01010 are zebras, while sequences 1, 0110, 0101 are not.\n\nOleg tells you the story of days he lived in chronological order in form of string consisting of 0 and 1. Now you are interested if it is possible to divide Oleg's life history into several subsequences, each of which is a zebra, and the way it can be done. Each day must belong to exactly one of the subsequences. For each of the subsequences, days forming it must be ordered chronologically. Note that subsequence does not have to be a group of consecutive days.\n\nInput Specification:\nIn the only line of input data there is a non-empty string *s* consisting of characters 0 and 1, which describes the history of Oleg's life. Its length (denoted as |*s*|) does not exceed 200<=000 characters.\n\nOutput Specification:\nIf there is a way to divide history into zebra subsequences, in the first line of output you should print an integer *k* (1<=\u2264<=*k*<=\u2264<=|*s*|), the resulting number of subsequences. In the *i*-th of following *k* lines first print the integer *l**i* (1<=\u2264<=*l**i*<=\u2264<=|*s*|), which is the length of the *i*-th subsequence, and then *l**i* indices of days forming the subsequence. Indices must follow in ascending order. Days are numbered starting from 1. Each index from 1 to *n* must belong to exactly one subsequence. If there is no way to divide day history into zebra subsequences, print -1.\n\nSubsequences may be printed in any order. If there are several solutions, you may print any of them. You do not have to minimize nor maximize the value of *k*.\n\nDemo Input:\n['0010100\\n', '111\\n']\n\nDemo Output:\n['3\\n3 1 3 4\\n3 2 5 6\\n1 7\\n', '-1\\n']\n\nNote:\nnone\n\nWrite Python code to solve the problem. Present the code in \n```python\nYour code\n```\nat the end.",
      "role": "user"
    }
  ],
  "ability": "code",
  "reward_model": {
    "ground_truth": "{\"inputs\": [\"0010100\", \"111\", \"0\", \"1\", \"0101010101\", \"010100001\", \"000111000\", \"0101001000\", \"0000001000\", \"0101\", \"000101110\", \"010101010\", \"0101001010\", \"0100101100\", \"0110100000\", \"0000000000\", \"1111111111\", \"0010101100\", \"1010000\", \"0001110\", \"0000000000011001100011110101000101000010010111000100110110000011010011110110001100100001001001010010\", \"01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010\", \"0010011100000000\"], \"outputs\": [\"3\\n1 1\\n5 2 3 4 5 6\\n1 7\", \"-1\", \"1\\n1 1\", \"-1\", \"-1\", \"-1\", \"3\\n3 1 6 7\\n3 2 5 8\\n3 3 4 9\", \"4\\n5 1 2 3 4 5\\n3 6 7 8\\n1 9\\n1 10\", \"8\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n3 6 7 8\\n1 9\\n1 10\", \"-1\", \"-1\", \"1\\n9 1 2 3 4 5 6 7 8 9\", \"2\\n5 1 2 3 4 5\\n5 6 7 8 9 10\", \"2\\n5 1 2 3 8 9\\n5 4 5 6 7 10\", \"-1\", \"10\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n1 6\\n1 7\\n1 8\\n1 9\\n1 10\", \"-1\", \"2\\n3 1 8 9\\n7 2 3 4 5 6 7 10\", \"-1\", \"-1\", \"22\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n1 6\\n1 7\\n1 8\\n7 9 24 25 26 27 28 29\\n7 10 13 14 17 18 23 30\\n11 11 12 15 16 19 22 31 32 33 34 35\\n3 20 21 36\\n3 37 46 47\\n9 38 39 40 45 48 57 58 75 76\\n17 41 42 43 44 49 50 51 54 55 56 59 72 73 74 77 80 81\\n9 52 53 60 71 78 79 82 83 84\\n7 61 64 65 66 67 70 85\\n5 62 63 68 69 86\\n3 87 88 89\\n3 90 91 92\\n5 93 94 95 96 97\\n3 98 99 100\", \"1\\n245 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 ...\", \"8\\n3 1 8 9\\n5 2 3 4 7 10\\n3 5 6 11\\n1 12\\n1 13\\n1 14\\n1 15\\n1 16\"]}",
    "style": "rule"
  },
  "extra_info": {
    "index": 49,
    "split": "train"
  }
}
```

**報酬関数**

報酬関数: [https://github.com/volcengine/verl/tree/main/verl/utils/reward\$1score/prime\$1code](https://github.com/volcengine/verl/tree/main/verl/utils/reward_score/prime_code)

## RLVR - 数学の回答
<a name="model-customize-open-weight-samples-RLVR3"></a>

**入力データセット**

ソース: Cleaned gsm8k データセット [https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**サンプル**

```
[
    {
        "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries...",
        "role": "system"
    },
    {
        "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
        "role": "user"
    },
    {
        "content": "\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections...\n\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n<reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct...\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct...\n</adjustment>\n\n<output>\n15\n</output>",
        "role": "assistant"
    }
]
```

**報酬の計算**

報酬関数: [https://github.com/volcengine/verl/blob/main/verl/utils/reward\$1score/gsm8k.py](https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py)

## RLVR - カスタム Lambda
<a name="model-customize-open-weight-samples-RLVR4"></a>

**入力データセット**

ソース: Cleaned gsm8k データセット [https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**サンプル**

```
[
  {
    "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.\n\nFollow these steps:\n- Think through the problem step by step within the ‹thinking> tags.\n- Reflect on your thinking\n\nto check for any errors or improvements within the ‹reflection› tags.\n- Make any necessary adjustments based on your reflection.\n- Provide your final, concise answer within the ‹output> tags.\n\nImportant: The <thinking> and ‹reflection› sections are for your internal reasoning process only.\nDo not include any part of the final answer in these sections.\nThe actual response to the query must be entirely contained within the ‹output› tags.\n\nUse the following format for your response:\n<thinking>\n[Your initial thought process goes here]\n</thinking›\n\n<reasoning>\n[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer. You can create as many reasoning steps as necessary in your process.]\n</reasoning>\n\n‹reflection>\n[Your reflection on your reasoning, checking for errors or improvements. You can create as many reflection steps as necessary in your process.]\n</ reflection>\n\n<adjustment>\n[Any adjustments to your thinking based on your reflection]\n</adjustment>\n\n<output>\n[Your final, concise answer to the query. This is the only part that will be shown to the user.]\n</output>\n",
    "role": "system"
  },
  {
    "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
    "role": "user"
  },
  {
    "content": "<thinking>\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections. The store makes a third of its sales in the fabric section, which is 1/3 of 36 sales. I can calculate this as 36 * 1/3 = 12 sales in the fabric section. The store makes a quarter of its sales in the jewelry section, which is 1/4 of 36 sales. I can calculate this as 36 * 1/4 = 9 sales in the jewelry section. Since the rest of the sales are in the stationery section, I can subtract the sales from the other two sections from the total number of sales to find the number of sales in the stationery section.\n</thinking>\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n‹reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct, but I should ensure that I understand the question correctly. The question asks for the number of sales in the stationery section, and I believe I have correctly calculated that. However, I should double-check my math to make sure it's accurate.\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct. The fabric section makes up 1/3 of the sales, the jewelry section makes up 1/4, and the rest is the stationery section. My math checks out, and I am confident in my answer.\n</adjustment>\n\n<output>\n15\n</output>",
    "role": "assistant"
  }
]
```

**報酬の計算例**

```
# RLVR Evaluator for OSS

# lambda_grader.py
import json
import re
import uuid
from typing import Any, Dict, List
 
def custom_reward(assistant_answer: str, ground_truth: str) -> float:
    """
    Add custom reward computation here
 
    Example:-
    Reward = fraction of ground-truth words that are correct
    in the correct position.
 
    Example:
      gt:   "the cat sat"
      ans:  "the dog sat"
 
      word-by-word:
        "the" == "the"  -> correct
        "dog" != "cat"  -> wrong
        "sat" == "sat"  -> correct
 
      correct = 2 out of 3 -> reward = 2/3 ≈ 0.67
    """
    ans_words = assistant_answer.strip().lower().split()
    gt_words = ground_truth.strip().lower().split()
 
    if not gt_words:
        return 0.0
 
    correct = 0
    for aw, gw in zip(ans_words, gt_words):
        if aw == gw:
            correct += 1
 
    return correct / len(gt_words)
 
 
# Lambda utility functions
def _ok(body: Any, code: int = 200) -> Dict[str, Any]:
    return {
        "statusCode": code,
        "headers": {
            "content-type": "application/json",
            "access-control-allow-origin": "*",
            "access-control-allow-methods": "POST,OPTIONS",
            "access-control-allow-headers": "content-type",
        },
        "body": json.dumps(body),
    }
 
def _assistant_text(sample: Dict[str, Any]) -> str:
    """Extract assistant text from sample messages."""
    for m in reversed(sample.get("messages", [])):
        if m.get("role") == "assistant":
            return (m.get("content") or "").strip()
    return ""
 
def _sample_id(sample: Dict[str, Any]) -> str:
    """Generate or extract sample ID."""
    if isinstance(sample.get("id"), str) and sample["id"]:
        return sample["id"]
 
    return str(uuid.uuid4())
 
def _ground_truth(sample: Dict[str, Any]) -> str:
    """Extract ground truth from sample or metadata if available"""
 
    if isinstance(sample.get("reference_answer"), str) and sample["reference_answer"]:
        return sample["reference_answer"].strip()
 
    md = sample.get("metadata") or {}
    gt = md.get("reference_answer", None) or md.get("ground_truth", None)
    if gt is None:
        return ""
    return str(gt).strip()
 
 
def _score_and_metrics(sample: Dict[str, Any]) -> Dict[str, Any]:
    sid = _sample_id(sample)
    solution_text = _assistant_text(sample)
 
    # Extract ground truth
    gt = _ground_truth(sample)
 
    metrics_list: List[Dict[str, Any]] = []
 
    # Custom rlvr scoring
    if solution_text and gt:
        
        # Compute score
        reward_score = custom_reward(
            assistant_answer=solution_text,
            ground_truth=gt
        )
        
        # Add detailed metrics
        metrics_list.append({
            "name": "custom_reward_score", 
            "value": float(reward_score), 
            "type": "Reward"
        })
       
        # The aggregate reward score is the custom reward score
        aggregate_score = float(reward_score)
        
    else:
        # No solution text or ground truth - default to 0
        aggregate_score = 0.0
        metrics_list.append({
            "name": "default_zero", 
            "value": 0.0, 
            "type": "Reward"
        })
    print("detected score", {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    })
    return {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    }
 
def lambda_handler(event, context):
    """AWS Lambda handler for custom reward lambda grading."""
    # CORS preflight
    if event.get("requestContext", {}).get("http", {}).get("method") == "OPTIONS":
        return _ok({"ok": True})
 
    # Body may be a JSON string (API GW/Function URL) or already a dict (Invoke)
    raw = event.get("body") or "{}"
    try:
        body = json.loads(raw) if isinstance(raw, str) else raw
    except Exception as e:
        return _ok({"error": f"invalid JSON body: {e}"}, 400)
 
    # Accept top-level list, {"batch":[...]}, or single sample object
    if isinstance(body, dict) and isinstance(body.get("batch"), list):
        samples = body["batch"]
    else:
        return _ok({
            "error": "Send a sample object, or {'batch':[...]} , or a top-level list of samples."
        }, 400)
 
    try:
        results = [_score_and_metrics(s) for s in samples]
    except Exception as e:
        return _ok({"error": f"Custom scoring failed: {e}"}, 500)
 
    return _ok(results)
```

**サンプル報酬関数コード**

```
# RLVR Evaluator for OSS
# lambda_grader.py

import json
import re
import uuid from typing 
import Any, Dict, List
 
def custom_reward(assistant_answer: str, ground_truth: str) -> float:
    """
    Add custom reward computation here
 
    Example:-
    Reward = fraction of ground-truth words that are correct
    in the correct position.
 
    Example:
      gt:   "the cat sat"
      ans:  "the dog sat"
 
      word-by-word:
        "the" == "the"  -> correct
        "dog" != "cat"  -> wrong
        "sat" == "sat"  -> correct
 
      correct = 2 out of 3 -> reward = 2/3 ≈ 0.67
    """
    ans_words = assistant_answer.strip().lower().split()
    gt_words = ground_truth.strip().lower().split()
 
    if not gt_words:
        return 0.0
 
    correct = 0
    for aw, gw in zip(ans_words, gt_words):
        if aw == gw:
            correct += 1
 
    return correct / len(gt_words)
 
 
# Lambda utility functions
def _ok(body: Any, code: int = 200) -> Dict[str, Any]:
    return {
        "statusCode": code,
        "headers": {
            "content-type": "application/json",
            "access-control-allow-origin": "*",
            "access-control-allow-methods": "POST,OPTIONS",
            "access-control-allow-headers": "content-type",
        },
        "body": json.dumps(body),
    }
 
def _assistant_text(sample: Dict[str, Any]) -> str:
    """Extract assistant text from sample messages."""
    for m in reversed(sample.get("messages", [])):
        if m.get("role") == "assistant":
            return (m.get("content") or "").strip()
    return ""
 
def _sample_id(sample: Dict[str, Any]) -> str:
    """Generate or extract sample ID."""
    if isinstance(sample.get("id"), str) and sample["id"]:
        return sample["id"]
 
    return str(uuid.uuid4())
 
def _ground_truth(sample: Dict[str, Any]) -> str:
    """Extract ground truth from sample or metadata if available"""
 
    if isinstance(sample.get("reference_answer"), str) and sample["reference_answer"]:
        return sample["reference_answer"].strip()
 
    md = sample.get("metadata") or {}
    gt = md.get("reference_answer", None) or md.get("ground_truth", None)
    if gt is None:
        return ""
    return str(gt).strip()
 
 
def _score_and_metrics(sample: Dict[str, Any]) -> Dict[str, Any]:
    sid = _sample_id(sample)
    solution_text = _assistant_text(sample)
 
    # Extract ground truth
    gt = _ground_truth(sample)
 
    metrics_list: List[Dict[str, Any]] = []
 
    # Custom rlvr scoring
    if solution_text and gt:
        
        # Compute score
        reward_score = custom_reward(
            assistant_answer=solution_text,
            ground_truth=gt
        )
        
        # Add detailed metrics
        metrics_list.append({
            "name": "custom_reward_score", 
            "value": float(reward_score), 
            "type": "Reward"
        })
       
        # The aggregate reward score is the custom reward score
        aggregate_score = float(reward_score)
        
    else:
        # No solution text or ground truth - default to 0
        aggregate_score = 0.0
        metrics_list.append({
            "name": "default_zero", 
            "value": 0.0, 
            "type": "Reward"
        })
    print("detected score", {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    })
    return {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    }
 
def lambda_handler(event, context):
    """AWS Lambda handler for custom reward lambda grading."""
    # CORS preflight
    if event.get("requestContext", {}).get("http", {}).get("method") == "OPTIONS":
        return _ok({"ok": True})
 
    # Body may be a JSON string (API GW/Function URL) or already a dict (Invoke)
    raw = event.get("body") or "{}"
    try:
        body = json.loads(raw) if isinstance(raw, str) else raw
    except Exception as e:
        return _ok({"error": f"invalid JSON body: {e}"}, 400)
 
    # Accept top-level list, {"batch":[...]}, or single sample object
    if isinstance(body, dict) and isinstance(body.get("batch"), list):
        samples = body["batch"]
    else:
        return _ok({
            "error": "Send a sample object, or {'batch':[...]} , or a top-level list of samples."
        }, 400)
 
    try:
        results = [_score_and_metrics(s) for s in samples]
    except Exception as e:
        return _ok({"error": f"Custom scoring failed: {e}"}, 500)
 
    return _ok(results)
```

**サンプル報酬プロンプト**

```
You are an expert RAG response evaluator specializing in faithfulness and relevance assessment.
Given: Context documents, a question, and response statements.
Goal: Evaluate both statement-level faithfulness and overall response relevance to the question.

Scoring rubric (start at 0.0, then add or subtract):

Core Components:

Faithfulness Assessment (0.6 max)
Per statement evaluation:
- Direct support in context: +0.2
- Accurate inference from context: +0.2
- No contradictions with context: +0.2
Deductions:
- Hallucination: -0.3
- Misrepresentation of context: -0.2
- Unsupported inference: -0.1

Question Relevance (0.4 max)
- Direct answer to question: +0.2
- Appropriate scope/detail: +0.1
- Proper context usage: +0.1
Deductions:
- Off-topic content: -0.2
- Implicit/meta responses: -0.2
- Missing key information: -0.1

Critical Flags:
A. Complete hallucination
B. Context misalignment
C. Question misinterpretation
D. Implicit-only responses

Additional Instructions:
- Evaluate each statement independently
- Check for direct textual support
- Verify logical inferences
- Assess answer completeness
- Flag any unsupported claims

Return EXACTLY this JSON (no extra text):
{
    "statements_evaluation": [
        {
            "statement": "<statement_text>",
            "verdict": <0 or 1>,
            "reason": "<detailed explanation>",
            "context_support": "<relevant context quote or 'None'>"
        }
    ],
    "overall_assessment": {
        "question_addressed": <0 or 1>,
        "reasoning": "<explanation>",
        "faithfulness_score": <0.0-1.0>,
        "relevance_score": <0.0-1.0>
    },
    "flags": ["<any critical issues>"]
}

## Current Evaluation Task

### Context
{{ ground_truth }}

### Question
{{ extra_info.question }}

### Model's Response
{{ model_output }}
```

# リリースノート
<a name="model-customize-release-note"></a>

**SageMaker AI モデルのカスタマイズイメージ**

**サポートプラン**
+ メジャーバージョン: 次のメジャーリリースから 12 か月後
+ マイナーバージョン: 次のマイナーリリースから 6 か月後
+ パッチバージョン: サポートの保証なし (最新のパッチへのアップグレード)

Amazon EKS (EKS) および SageMaker AI トレーニングジョブ (SMTJ) 用の Base Deep Learning Containers のリリースノートを以下に示します。


****  

| バージョン | タイプ | サービス | イメージ URL | 
| --- | --- | --- | --- | 
|  1.0.0  | CUDA | EKS |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:verl-eks-v1.0.0`  | 
|  1.0.0  | CUDA | SMTJ |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:verl-smtj-v1.0.0`  | 
|  1.0.0  | CUDA | SMJT |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:v1-v1.0.0`  | 
|  1.0.0  | CUDA | SMTJ |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:llama-90b-v1.0.0`  | 

**AWS リージョン ** のサポート


****  

| リージョン | コード | サーバーレス SMTJ サポート | 
| --- | --- | --- | 
| アジアパシフィック (ムンバイ) | ap-south-1 | なし | 
| アジアパシフィック (シンガポール) | ap-southeast-1 | なし | 
| アジアパシフィック (シドニー) | ap-southeast-2 | なし | 
| アジアパシフィック (東京) | ap-northeast-1 | はい | 
| 欧州 (フランクフルト) | eu-central-1 | なし | 
| 欧州 (アイルランド) | eu-west-1 | はい | 
| 欧州 (ストックホルム) | eu-north-1 | なし | 
| 南米 (サンパウロ） | sa-east-1 | なし | 
| 米国東部 (バージニア北部) | us-east-1 | はい | 
| 米国東部 (オハイオ) | us-east-2 | なし | 
| 米国西部（北カリフォルニア) | us-west-1 | なし | 
| 米国西部 (オレゴン) | us-west-2 | はい | 